Using Data Mining Techniques into Real Estates Industry

Author: Madalina-Alina Racovita, 1st year master's student on Computational Optimization at Faculty of Computer Science, UAIC, Iasi, Romania

title

Motivation

The way in which the value of a house is currently set in the real estate industry is not necessarily statistically robust. There are two states in which a person can be in the real estate market: seller or buyer. Regardless of the hypostase in which a given person is, it is a certain fact she may under evaluate or she may overestimate that property by reasoning subjectively.

Given the hundreds of property sales that occur in one specific period of time, it is logical to evaluate and analyze the sales from multiple points of view to establish a more accurate value for a house with a specific characterization. Regression analysis is a statistical approach of modeling the relationship between one or more independent or explanatory variables (characteristics of a house), and a dependent variable (the value or selling price of the house).The already created models have proven that using regression analysis is a viable way to better establish an estimate of the true value of a house. Classification can also help in making distinctions between different properties.

Introduction

This notebook is attempting to achieve explanatory data analysis, feature selection and feature engineering required for the task of building a regression / classification model that will be able to predict the value of an independent variable (for instance, the sale price of a house in the case of regression or the type of the property in a classification context) as accurately as possible by minimizing a cost error function. title

Import dependencies & environment configuration

In [1]:
!pip install missingno
!pip install folium
!pip install ipympl
!pip install plotly
# !pip install projection-pursuit==0.4
Requirement already satisfied: missingno in c:\tools\anaconda3\lib\site-packages (0.4.2)
Requirement already satisfied: numpy in c:\tools\anaconda3\lib\site-packages (from missingno) (1.18.1)
Requirement already satisfied: matplotlib in c:\tools\anaconda3\lib\site-packages (from missingno) (3.0.3)
Requirement already satisfied: seaborn in c:\tools\anaconda3\lib\site-packages (from missingno) (0.9.0)
Requirement already satisfied: scipy in c:\tools\anaconda3\lib\site-packages (from missingno) (1.2.1)
Requirement already satisfied: cycler>=0.10 in c:\tools\anaconda3\lib\site-packages (from matplotlib->missingno) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib->missingno) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib->missingno) (2.4.6)
Requirement already satisfied: python-dateutil>=2.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib->missingno) (2.8.1)
Requirement already satisfied: pandas>=0.15.2 in c:\tools\anaconda3\lib\site-packages (from seaborn->missingno) (0.25.2)
Requirement already satisfied: six in c:\tools\anaconda3\lib\site-packages (from cycler>=0.10->matplotlib->missingno) (1.14.0)
Requirement already satisfied: setuptools in c:\tools\anaconda3\lib\site-packages (from kiwisolver>=1.0.1->matplotlib->missingno) (45.2.0.post20200210)
Requirement already satisfied: pytz>=2017.2 in c:\tools\anaconda3\lib\site-packages (from pandas>=0.15.2->seaborn->missingno) (2019.3)
Requirement already satisfied: folium in c:\tools\anaconda3\lib\site-packages (0.10.1)
Requirement already satisfied: requests in c:\tools\anaconda3\lib\site-packages (from folium) (2.22.0)
Requirement already satisfied: numpy in c:\tools\anaconda3\lib\site-packages (from folium) (1.18.1)
Requirement already satisfied: jinja2>=2.9 in c:\tools\anaconda3\lib\site-packages (from folium) (2.11.1)
Requirement already satisfied: branca>=0.3.0 in c:\tools\anaconda3\lib\site-packages (from folium) (0.4.0)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in c:\tools\anaconda3\lib\site-packages (from requests->folium) (1.25.8)
Requirement already satisfied: idna<2.9,>=2.5 in c:\tools\anaconda3\lib\site-packages (from requests->folium) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in c:\tools\anaconda3\lib\site-packages (from requests->folium) (2019.11.28)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\tools\anaconda3\lib\site-packages (from requests->folium) (3.0.4)
Requirement already satisfied: MarkupSafe>=0.23 in c:\tools\anaconda3\lib\site-packages (from jinja2>=2.9->folium) (1.1.1)
Requirement already satisfied: six in c:\tools\anaconda3\lib\site-packages (from branca>=0.3.0->folium) (1.14.0)
Requirement already satisfied: ipympl in c:\tools\anaconda3\lib\site-packages (0.4.1)
Requirement already satisfied: ipywidgets>=7.5.0 in c:\tools\anaconda3\lib\site-packages (from ipympl) (7.5.1)
Requirement already satisfied: matplotlib>=2.0.0 in c:\tools\anaconda3\lib\site-packages (from ipympl) (3.0.3)
Requirement already satisfied: ipykernel>=4.7 in c:\tools\anaconda3\lib\site-packages (from ipympl) (5.1.4)
Requirement already satisfied: ipython>=4.0.0; python_version >= "3.3" in c:\tools\anaconda3\lib\site-packages (from ipywidgets>=7.5.0->ipympl) (7.12.0)
Requirement already satisfied: nbformat>=4.2.0 in c:\tools\anaconda3\lib\site-packages (from ipywidgets>=7.5.0->ipympl) (5.0.4)
Requirement already satisfied: widgetsnbextension~=3.5.0 in c:\tools\anaconda3\lib\site-packages (from ipywidgets>=7.5.0->ipympl) (3.5.1)
Requirement already satisfied: traitlets>=4.3.1 in c:\tools\anaconda3\lib\site-packages (from ipywidgets>=7.5.0->ipympl) (4.3.3)
Requirement already satisfied: numpy>=1.10.0 in c:\tools\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (1.18.1)
Requirement already satisfied: cycler>=0.10 in c:\tools\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (0.10.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (1.1.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (2.4.6)
Requirement already satisfied: python-dateutil>=2.1 in c:\tools\anaconda3\lib\site-packages (from matplotlib>=2.0.0->ipympl) (2.8.1)
Requirement already satisfied: tornado>=4.2 in c:\tools\anaconda3\lib\site-packages (from ipykernel>=4.7->ipympl) (6.0.3)
Requirement already satisfied: jupyter-client in c:\tools\anaconda3\lib\site-packages (from ipykernel>=4.7->ipympl) (5.3.4)
Requirement already satisfied: setuptools>=18.5 in c:\tools\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (45.2.0.post20200210)
Requirement already satisfied: jedi>=0.10 in c:\tools\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (0.16.0)
Requirement already satisfied: colorama; sys_platform == "win32" in c:\tools\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (0.4.3)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\tools\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (3.0.3)
Requirement already satisfied: backcall in c:\tools\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (0.1.0)
Requirement already satisfied: pygments in c:\tools\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (2.5.2)
Requirement already satisfied: pickleshare in c:\tools\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (0.7.5)
Requirement already satisfied: decorator in c:\tools\anaconda3\lib\site-packages (from ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (4.4.1)
Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\tools\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.0->ipympl) (3.2.0)
Requirement already satisfied: jupyter-core in c:\tools\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.0->ipympl) (4.6.1)
Requirement already satisfied: ipython-genutils in c:\tools\anaconda3\lib\site-packages (from nbformat>=4.2.0->ipywidgets>=7.5.0->ipympl) (0.2.0)
Requirement already satisfied: notebook>=4.4.1 in c:\tools\anaconda3\lib\site-packages (from widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (6.0.3)
Requirement already satisfied: six in c:\tools\anaconda3\lib\site-packages (from traitlets>=4.3.1->ipywidgets>=7.5.0->ipympl) (1.14.0)
Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\tools\anaconda3\lib\site-packages (from jupyter-client->ipykernel>=4.7->ipympl) (227)
Requirement already satisfied: pyzmq>=13 in c:\tools\anaconda3\lib\site-packages (from jupyter-client->ipykernel>=4.7->ipympl) (18.1.1)
Requirement already satisfied: parso>=0.5.2 in c:\tools\anaconda3\lib\site-packages (from jedi>=0.10->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (0.6.1)
Requirement already satisfied: wcwidth in c:\tools\anaconda3\lib\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=4.0.0; python_version >= "3.3"->ipywidgets>=7.5.0->ipympl) (0.1.8)
Requirement already satisfied: attrs>=17.4.0 in c:\tools\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.0->ipympl) (19.3.0)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in c:\tools\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.0->ipympl) (1.5.0)
Requirement already satisfied: pyrsistent>=0.14.0 in c:\tools\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.0->ipympl) (0.15.7)
Requirement already satisfied: terminado>=0.8.1 in c:\tools\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (0.8.3)
Requirement already satisfied: prometheus-client in c:\tools\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (0.7.1)
Requirement already satisfied: Send2Trash in c:\tools\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (1.5.0)
Requirement already satisfied: nbconvert in c:\tools\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (5.6.1)
Requirement already satisfied: jinja2 in c:\tools\anaconda3\lib\site-packages (from notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (2.11.1)
Requirement already satisfied: zipp>=0.5 in c:\tools\anaconda3\lib\site-packages (from importlib-metadata; python_version < "3.8"->jsonschema!=2.5.0,>=2.4->nbformat>=4.2.0->ipywidgets>=7.5.0->ipympl) (2.2.0)
Requirement already satisfied: entrypoints>=0.2.2 in c:\tools\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (0.3)
Requirement already satisfied: bleach in c:\tools\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (3.1.0)
Requirement already satisfied: defusedxml in c:\tools\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (0.6.0)
Requirement already satisfied: testpath in c:\tools\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (0.4.4)
Requirement already satisfied: pandocfilters>=1.4.1 in c:\tools\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (1.4.2)
Requirement already satisfied: mistune<2,>=0.8.1 in c:\tools\anaconda3\lib\site-packages (from nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (0.8.4)
Requirement already satisfied: MarkupSafe>=0.23 in c:\tools\anaconda3\lib\site-packages (from jinja2->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (1.1.1)
Requirement already satisfied: webencodings in c:\tools\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.4.1->widgetsnbextension~=3.5.0->ipywidgets>=7.5.0->ipympl) (0.5.1)
Requirement already satisfied: plotly in c:\tools\anaconda3\lib\site-packages (4.5.1)
Requirement already satisfied: retrying>=1.3.3 in c:\tools\anaconda3\lib\site-packages (from plotly) (1.3.3)
Requirement already satisfied: six in c:\tools\anaconda3\lib\site-packages (from plotly) (1.14.0)
In [2]:
import pandas as pd
import os
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import randomcolor
import missingno as msno 
import folium
import warnings
import random
import plotly.express as px
import umap

from folium.plugins import HeatMap
from scipy.stats import kurtosis
from scipy.stats import skew
from mpl_toolkits import mplot3d
from mpl_toolkits.mplot3d import Axes3D 
from scipy.stats import pearsonr
from sklearn.decomposition import PCA
from numpy import linalg as LA
from scipy import stats

warnings.filterwarnings('ignore')
matplotlib.style.use('ggplot')

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

Load dataframes

In [3]:
os.listdir('./Data')
Out[3]:
['RCON_12011.assessor.tsv',
 'RCON_53033.assessor.tsv',
 'RSFR_12011.assessor.tsv',
 'RSFR_53033.assessor.tsv']

RCON dataframes loading

In [4]:
import pandas as pd
from scipy.stats import skew
In [5]:
df_rcon1 = pd.read_csv("./Data/RCON_12011.assessor.tsv", sep = "\t")
In [6]:
df_rcon1.shape
Out[6]:
(21637, 63)
In [7]:
df_rcon2 = pd.read_csv("./Data/RCON_53033.assessor.tsv", sep = "\t")
In [8]:
df_rcon2.shape
Out[8]:
(11265, 63)

We are going to merge those two dataframes concerning RCON real estates in the purpose of an overview analysis.

In [9]:
df_rcon = pd.concat([df_rcon1, df_rcon2])
df_rcon.shape
Out[9]:
(32902, 63)
In [10]:
print("Number of observations in RCON dataset: ", df_rcon.shape[0])
print("Number of predictors in RCON dataset: ", df_rcon.shape[1] - 1)
Number of observations in RCON dataset:  32902
Number of predictors in RCON dataset:  62

RSFR (single family residential) dataframes loading

In [11]:
df_rsfr1 = pd.read_csv("./Data/RSFR_12011.assessor.tsv", sep = "\t")
In [12]:
df_rsfr1.shape
Out[12]:
(32838, 63)
In [13]:
df_rsfr2 = pd.read_csv("./Data/RSFR_53033.assessor.tsv", sep = "\t")
In [14]:
df_rsfr2.shape
Out[14]:
(53041, 63)

As did in the case of RCON dataframe, the same step is going to be proceed for the RSFR real estates dataframe: there are going to be merged the df_rsfr1 and df_rsfr2 dataframes.

In [15]:
df_rsfr = pd.concat([df_rsfr1, df_rsfr2])
In [16]:
print("Number of observations in RSFR dataset: ", df_rsfr.shape[0])
print("Number of predictors in RSFR dataset: ", df_rsfr.shape[1] - 1)
Number of observations in RSFR dataset:  85879
Number of predictors in RSFR dataset:  62

Columns description

  • CountyFipsCode = five-digit Federal Information Processing Standards code which uniquely identified counties and county equivalents in the United States
  • BuildingCode = ...
  • StructureCode = ...
  • StructureNbr = ...
  • LandSqft = square footage of the lot
  • LivingSqft = square footage of the living space
  • GarageSqft = square footage of the garage
  • BasementSqft = square footage of the basement
  • BasementFinishedSqft = square footage of the basement which does not need any renovations
  • AtticSqft = square footage of the basement attic
  • Bedrooms = number of bedrooms
  • TotalRooms = total number of rooms
  • TotalBaths = number of bathrooms
  • FirePlaces = number of bedrooms
  • YearBuilt = the year in which the house was built
  • EffectiveYearBuilt = ...
  • Condition = English string literal describing property condition (from assessor record)
  • ConditionCode = Numeric type for [Condition]
  • Quality = English string literal describing property build quality (from assessor record)
  • QualityCode = Numeric type for [Quality]
  • GarageCarportCode = ...
  • GarageNoOfCars = the number of cars that can be parked into the garage
  • HasPatioPorch = ...
  • PatioPorchCode = ...
  • HasPool = boolean, shows if the house has or not a pool
  • PoolCode = numeric type for [HasPool]
  • Zonning = Per county description of lot restrictions (from assessor record), Text literal that is county dependent. It canbe sparse and refers to restrictions such as ‘R 9600’ – can build no more than 9600 sq. ft.
  • LandValue = Estimated value for lot (from assessor record)
  • ImprovementValue = Estimated value for building (from assessor record)
  • TotalValue = Total value, usually sum from Land and Improvement values (from assessor record)
  • AssessedYear = The year of assessment (from assessor record)
  • PropTaxAmount = Tax paid (from assessor record)
  • City = the city where the real estate can be found
  • State = the state where the real estate can be found
  • Zip = zip code
  • Latitude = latitude coordinate for the house
  • Longitude = longitude coordinate for the house
  • BuildingShapeCode = ...
  • ConstructionCode = ...
  • Stories = Number of stories (from assessor record)
  • UnitsInBuilding = Number of units in building (from assessor record). Usually 1 for ‘RSFR’
  • FoundationCode = ...
  • ExteriorCode = ...
  • RoofCode = ...
  • CoolingCode = ...
  • HeatingCode = ...
  • HeatingSourceCode = ...
  • IsWaterfront = ...
  • View = numerical variable
  • ViewScore = ...
  • LastSaleDate = ...
  • LastSalePrice = ...
  • DocType = ...
  • DeedType = ...
  • TransType = ...
  • ArmsLengthFlag = ...
  • DistressCode = Distress code that details why [DistrsdProp] is set to 1.
  • StatusDate = ...
  • SellDate = the date of the real estate's selling
  • SellPrice = the price of the house
  • OwnerOccupied = Set to 0 if suspected to be rented
  • DistrsdProp = Set to 1 if the property is in financial “distress” (behind mortgage payments, in the process of being foreclosed by the bank etc)
  • IsFixer = Set to 1 if the MLS agent reported the property “in need of work” (bad state)
In [17]:
df_rcon.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 32902 entries, 0 to 11264
Data columns (total 63 columns):
CountyFipsCode          32902 non-null int64
BuildingCode            11256 non-null float64
StructureCode           3 non-null float64
StructureNbr            32902 non-null int64
LandSqft                32902 non-null int64
LivingSqft              32902 non-null int64
GarageSqft              32902 non-null int64
BasementSqft            32902 non-null int64
BasementFinishedSqft    32902 non-null int64
AtticSqft               32902 non-null int64
Bedrooms                32902 non-null int64
TotalRooms              32902 non-null int64
TotalBaths              32902 non-null float64
FirePlaces              32902 non-null int64
YearBuilt               32902 non-null int64
EffectiveYearBuilt      32902 non-null int64
Condition               32902 non-null object
ConditionCode           32902 non-null int64
Quality                 32902 non-null object
QualityCode             32902 non-null int64
GarageCarportCode       8525 non-null object
GarageNoOfCars          32902 non-null int64
HasPatioPorch           32902 non-null bool
PatioPorchCode          522 non-null float64
HasPool                 32902 non-null bool
PoolCode                21 non-null object
Zonning                 14316 non-null object
LandValue               32902 non-null int64
ImprovementValue        32902 non-null int64
TotalValue              32902 non-null int64
AssessedYear            32902 non-null int64
PropTaxAmount           32893 non-null float64
City                    0 non-null float64
State                   32902 non-null object
Zip                     32902 non-null int64
Latitude                32902 non-null float64
Longitude               32902 non-null float64
BuildingShapeCode       0 non-null float64
ConstructionCode        32902 non-null int64
Stories                 32902 non-null int64
UnitsInBuilding         32902 non-null int64
FoundationCode          27 non-null float64
ExteriorCode            57 non-null float64
RoofCode                0 non-null float64
CoolingCode             42 non-null float64
HeatingCode             1415 non-null object
HeatingSourceCode       1404 non-null float64
IsWaterfront            32902 non-null bool
View                    9879 non-null float64
ViewScore               32902 non-null int64
LastSaleDate            32902 non-null object
LastSalePrice           32902 non-null int64
DocType                 22592 non-null object
DeedType                0 non-null float64
TransType               27958 non-null object
ArmsLengthFlag          32902 non-null bool
DistressCode            2453 non-null object
StatusDate              32902 non-null object
SellDate                32902 non-null object
SellPrice               21985 non-null float64
OwnerOccupied           32902 non-null bool
DistrsdProp             32902 non-null int64
IsFixer                 32902 non-null float64
dtypes: bool(5), float64(18), int64(27), object(13)
memory usage: 15.0+ MB

Overview on the RCON and RSFR dataset

In [18]:
df_rcon.head()
Out[18]:
CountyFipsCode BuildingCode StructureCode StructureNbr LandSqft LivingSqft GarageSqft BasementSqft BasementFinishedSqft AtticSqft Bedrooms TotalRooms TotalBaths FirePlaces YearBuilt EffectiveYearBuilt Condition ConditionCode Quality QualityCode GarageCarportCode GarageNoOfCars HasPatioPorch PatioPorchCode HasPool PoolCode Zonning LandValue ImprovementValue TotalValue AssessedYear PropTaxAmount City State Zip Latitude Longitude BuildingShapeCode ConstructionCode Stories UnitsInBuilding FoundationCode ExteriorCode RoofCode CoolingCode HeatingCode HeatingSourceCode IsWaterfront View ViewScore LastSaleDate LastSalePrice DocType DeedType TransType ArmsLengthFlag DistressCode StatusDate SellDate SellPrice OwnerOccupied DistrsdProp IsFixer
0 12011 NaN NaN 1 3999 1180 0 0 0 0 2 0 2.0 0 1977 0 AVE 2 QAV 6 NaN 0 False NaN False NaN NaN 6059 54499 60560 2016 1616.0 NaN FL 33321 26.206 -80.265 NaN 1 0 1 NaN NaN NaN NaN NaN NaN False NaN 0 1998-03-27 00:00:00 40000 NaN NaN R True NaN 2017-02-10 00:00:00 1998-03-27 40000.0 False 0 0.0
1 12011 NaN NaN 1 3999 800 0 0 0 0 1 0 1.5 0 1973 0 AVE 2 QAV 6 NaN 0 False NaN False NaN NaN 5079 45759 50840 2016 1560.0 NaN FL 33319 26.171 -80.231 NaN 1 0 1 NaN NaN NaN NaN NaN NaN False NaN 0 2006-10-06 00:00:00 100000 W NaN R True NaN 2017-02-10 00:00:00 2006-10-06 100000.0 False 0 0.0
2 12011 NaN NaN 1 3999 825 0 0 0 0 2 0 1.0 0 1968 0 AVE 2 QAV 6 NaN 0 False NaN False NaN RM-18 7439 66979 74420 2016 420.0 NaN FL 33020 26.018 -80.155 NaN 1 0 1 NaN NaN NaN NaN NaN NaN False NaN 0 2003-12-05 00:00:00 78000 G NaN R True NaN 2017-02-10 00:00:00 2003-12-05 78000.0 True 0 0.0
3 12011 NaN NaN 1 3999 750 0 0 0 0 1 0 1.5 0 1989 0 AVE 2 QAV 6 NaN 0 False NaN False NaN NaN 4929 44329 49260 2016 1300.0 NaN FL 33063 26.263 -80.232 NaN 1 0 1 NaN NaN NaN NaN NaN NaN False NaN 0 2006-11-28 00:00:00 111500 W NaN R True NaN 2017-02-10 00:00:00 2006-11-28 111500.0 True 0 0.0
4 12011 NaN NaN 1 3999 1250 0 0 0 0 2 0 2.0 0 1988 0 AVE 2 QAV 6 NaN 0 False NaN False NaN R-4C 13959 125669 139630 2016 880.0 NaN FL 33442 26.297 -80.158 NaN 1 0 1 NaN NaN NaN NaN NaN NaN False NaN 0 2009-03-04 00:00:00 85500 G NaN R True S 2017-02-10 00:00:00 2009-03-04 NaN True 3 0.0
In [19]:
set(df_rcon['TotalRooms'])
Out[19]:
{0}
In [20]:
df_rcon.describe()
Out[20]:
CountyFipsCode BuildingCode StructureCode StructureNbr LandSqft LivingSqft GarageSqft BasementSqft BasementFinishedSqft AtticSqft Bedrooms TotalRooms TotalBaths FirePlaces YearBuilt EffectiveYearBuilt ConditionCode QualityCode GarageNoOfCars PatioPorchCode LandValue ImprovementValue TotalValue AssessedYear PropTaxAmount City Zip Latitude Longitude BuildingShapeCode ConstructionCode Stories UnitsInBuilding FoundationCode ExteriorCode RoofCode CoolingCode HeatingSourceCode View ViewScore LastSalePrice DeedType SellPrice DistrsdProp IsFixer
count 32902.000000 11256.000000 3.0 32902.000000 3.290200e+04 32902.000000 32902.000000 32902.000000 32902.000000 32902.0 32902.000000 32902.0 32902.000000 32902.000000 32902.000000 32902.0 32902.000000 32902.000000 32902.0 522.000000 3.290200e+04 3.290200e+04 3.290200e+04 32902.000000 32893.000000 0.0 32902.000000 32902.000000 32902.000000 0.0 32902.000000 32902.000000 32902.000000 27.0 57.000000 0.0 42.000000 1404.000000 9879.000000 32902.000000 3.290200e+04 0.0 2.198500e+04 32902.000000 32902.0
mean 26056.128868 3.052239 -1.0 0.999757 6.147412e+04 1133.080633 11.961188 16.207920 7.072883 0.0 1.880311 0.0 1.825581 0.320862 1982.841621 0.0 2.050058 6.381770 0.0 1.994253 4.117705e+04 1.811306e+05 2.223094e+05 2015.992067 2576.931475 NaN 55379.678409 33.490786 -94.593287 NaN 0.733420 1.235153 21.854750 8.0 5.368421 NaN 10.857143 2.269943 9.952121 0.765729 3.194584e+05 NaN 2.070887e+05 0.228527 0.0
std 19465.486063 3.453175 0.0 0.023389 1.769297e+05 3144.911885 66.242595 115.874763 58.321785 0.0 0.708913 0.0 0.645229 0.466815 12.478862 0.0 0.232634 1.252693 0.0 0.531147 4.574657e+05 9.042000e+05 1.259713e+06 0.188168 2960.705532 NaN 30798.861244 10.170521 19.956632 NaN 0.528476 3.023849 46.507234 0.0 1.576972 NaN 2.646410 0.689340 4.297152 1.070396 4.340805e+06 NaN 2.169017e+05 0.882176 0.0
min 12011.000000 1.000000 -1.0 0.000000 3.999000e+03 99.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 1950.000000 0.0 1.000000 0.000000 0.0 1.000000 0.000000e+00 1.700000e+01 1.000000e+03 2010.000000 0.000000 NaN 0.000000 25.962000 -122.518000 NaN 0.000000 0.000000 0.000000 8.0 0.000000 NaN 3.000000 1.000000 2.000000 0.000000 0.000000e+00 NaN 2.250000e+03 0.000000 0.0
25% 12011.000000 1.000000 -1.0 1.000000 3.999000e+03 840.000000 0.000000 0.000000 0.000000 0.0 1.000000 0.0 1.500000 0.000000 1974.000000 0.0 2.000000 6.000000 0.0 2.000000 7.559000e+03 6.576900e+04 7.457500e+04 2016.000000 935.000000 NaN 33065.000000 26.144000 -122.190000 NaN 0.000000 0.000000 1.000000 8.0 5.000000 NaN 12.000000 2.000000 12.000000 0.000000 6.000000e+04 NaN 8.250000e+04 0.000000 0.0
50% 12011.000000 1.000000 -1.0 1.000000 3.999000e+03 1040.000000 0.000000 0.000000 0.000000 0.0 2.000000 0.0 2.000000 0.000000 1980.000000 0.0 2.000000 6.000000 0.0 2.000000 1.502000e+04 1.163090e+05 1.384000e+05 2016.000000 1889.000000 NaN 33319.000000 26.233000 -80.257000 NaN 1.000000 0.000000 1.000000 8.0 5.000000 NaN 12.000000 2.000000 12.000000 0.000000 1.250000e+05 NaN 1.599000e+05 0.000000 0.0
75% 53033.000000 8.000000 -1.0 1.000000 1.535500e+04 1270.000000 0.000000 0.000000 0.000000 0.0 2.000000 0.0 2.000000 1.000000 1991.000000 0.0 2.000000 6.000000 0.0 2.000000 4.388900e+04 2.166590e+05 2.700000e+05 2016.000000 3291.000000 NaN 98032.000000 47.544000 -80.150000 NaN 1.000000 2.000000 12.000000 8.0 7.000000 NaN 12.000000 2.000000 12.000000 2.000000 2.300000e+05 NaN 2.650000e+05 0.000000 0.0
max 53033.000000 9.000000 -1.0 3.000000 3.236574e+06 408116.000000 1120.000000 3470.000000 2730.000000 0.0 9.000000 0.0 7.000000 1.000000 2016.000000 0.0 5.000000 9.000000 0.0 5.000000 7.019320e+07 1.047947e+08 1.421090e+08 2016.000000 97836.000000 NaN 98354.000000 47.859000 -80.076000 NaN 3.000000 43.000000 402.000000 8.0 7.000000 NaN 12.000000 6.000000 15.000000 4.000000 6.750000e+08 NaN 9.159830e+06 5.000000 0.0
In [21]:
print("Number of numerical predictors RCON: ", len(list(df_rcon.describe())))
Number of numerical predictors RCON:  45
In [22]:
df_rsfr.head()
Out[22]:
CountyFipsCode BuildingCode StructureCode StructureNbr LandSqft LivingSqft GarageSqft BasementSqft BasementFinishedSqft AtticSqft Bedrooms TotalRooms TotalBaths FirePlaces YearBuilt EffectiveYearBuilt Condition ConditionCode Quality QualityCode GarageCarportCode GarageNoOfCars HasPatioPorch PatioPorchCode HasPool PoolCode Zonning LandValue ImprovementValue TotalValue AssessedYear PropTaxAmount City State Zip Latitude Longitude BuildingShapeCode ConstructionCode Stories UnitsInBuilding FoundationCode ExteriorCode RoofCode CoolingCode HeatingCode HeatingSourceCode IsWaterfront View ViewScore LastSaleDate LastSalePrice DocType DeedType TransType ArmsLengthFlag DistressCode StatusDate SellDate SellPrice OwnerOccupied DistrsdProp IsFixer
0 12011 NaN NaN 1 5250 1825 0 0 0 0 3 0 2.0 0 1989 0 AVE 3 QAV 6 NaN 0 True 5.0 True Y PRD-5Q 36749 208159 244910 2016 3262.0 NaN FL 33322 26.149 -80.282 NaN 1 1 1 8.0 5.0 NaN 12.0 NaN NaN False NaN 1 2004-10-06 00:00:00 294000 W NaN R True O 2017-04-15 11:48:00 2004-10-06 294000.0 True 1 0.0
1 12011 NaN NaN 1 7817 971 0 0 0 0 0 0 0.0 0 1958 0 AVE 3 QAV 6 NaN 0 False NaN False NaN RS-5 23449 50479 73930 2016 1794.0 NaN FL 33068 26.208 -80.213 NaN 1 1 1 8.0 5.0 NaN 12.0 NaN NaN False NaN 0 2008-12-09 00:00:00 81000 G NaN R True S 2017-02-10 00:00:00 2008-12-09 81000.0 False 3 0.0
2 12011 NaN NaN 1 5927 1859 0 0 0 0 3 0 2.0 0 1991 0 AVE 3 QAV 6 NaN 0 False NaN False NaN PUD 40009 232599 272610 2016 3112.0 NaN FL 33323 26.134 -80.316 NaN 1 1 1 8.0 5.0 NaN 12.0 NaN NaN False NaN 0 2006-08-28 00:00:00 375000 W NaN R True NaN 2017-02-10 00:00:00 2006-08-28 375000.0 True 0 0.0
3 12011 NaN NaN 1 7053 1540 0 0 0 0 4 0 2.0 0 1960 0 AVE 3 QAV 6 NaN 0 False NaN False NaN R-1C 42319 148759 191080 2016 1774.0 NaN FL 33023 26.009 -80.214 NaN 1 1 1 8.0 5.0 NaN 12.0 NaN NaN False NaN 0 2004-03-25 00:00:00 193000 G NaN R True NaN 2017-02-10 00:00:00 2004-03-25 193000.0 True 0 0.0
4 12011 NaN NaN 1 7931 1862 0 0 0 0 4 0 2.0 0 1977 0 AVE 3 QAV 6 NaN 0 False NaN False NaN RS-5 45599 170939 216540 2016 2161.0 NaN FL 33322 26.152 -80.297 NaN 1 1 1 8.0 5.0 NaN 12.0 NaN NaN False NaN 0 1998-06-29 00:00:00 113000 NaN NaN R True NaN 2017-02-10 00:00:00 1998-06-29 NaN True 0 0.0
In [23]:
df_rsfr.describe()
Out[23]:
CountyFipsCode BuildingCode StructureCode StructureNbr LandSqft LivingSqft GarageSqft BasementSqft BasementFinishedSqft AtticSqft Bedrooms TotalRooms TotalBaths FirePlaces YearBuilt EffectiveYearBuilt ConditionCode QualityCode GarageNoOfCars PatioPorchCode LandValue ImprovementValue TotalValue AssessedYear PropTaxAmount City Zip Latitude Longitude BuildingShapeCode ConstructionCode Stories UnitsInBuilding FoundationCode ExteriorCode RoofCode CoolingCode HeatingSourceCode View ViewScore LastSalePrice DeedType SellPrice DistrsdProp IsFixer
count 85879.00000 52780.000000 1.0 85879.000000 8.587900e+04 85879.000000 85879.000000 85879.000000 85879.000000 85879.0 85879.000000 85879.0 85879.000000 85879.000000 85879.000000 85879.0 85879.000000 85879.000000 85879.0 34458.000000 8.587900e+04 8.587900e+04 8.587900e+04 85879.000000 85854.000000 0.0 85879.000000 85879.000000 85879.000000 0.0 85879.000000 85879.000000 85879.000000 32697.000000 32832.000000 1.0 26054.000000 52771.000000 12803.000000 85879.000000 8.587900e+04 0.0 3.963200e+04 85879.000000 85879.0
mean 37347.20445 1.022717 -1.0 1.010655 1.622282e+04 1874.517647 211.161623 355.455431 179.267341 0.0 2.827292 0.0 1.977003 0.535474 1975.227867 0.0 3.211914 6.523073 0.0 2.770097 1.850358e+05 2.680594e+05 4.530971e+05 2015.996542 5160.540091 NaN 73126.300947 39.357506 -106.169047 NaN 0.378812 1.386183 0.956031 8.231397 5.188048 24.0 11.994934 2.467757 4.705069 0.693336 2.938055e+05 NaN 3.825581e+05 0.244833 0.0
std 19935.47468 0.416729 NaN 0.133182 6.083578e+04 1029.014108 281.659989 653.931705 385.812408 0.0 1.460753 0.0 1.210666 0.498743 18.869927 0.0 0.485548 1.243311 0.0 1.322711 2.843866e+05 4.280534e+05 6.377382e+05 0.121223 4094.548236 NaN 31643.595395 10.409928 20.398196 NaN 0.487082 0.510940 0.347963 1.718927 1.963427 NaN 0.196633 0.753664 3.734096 1.041690 4.413295e+06 NaN 3.645303e+05 0.914076 0.0
min 12011.00000 1.000000 -1.0 0.000000 3.999000e+03 99.000000 0.000000 0.000000 0.000000 0.0 0.000000 0.0 0.000000 0.000000 1950.000000 0.0 1.000000 0.000000 0.0 1.000000 9.900000e+01 4.990000e+02 1.100000e+03 2010.000000 0.000000 NaN 0.000000 25.957000 -122.526000 NaN 0.000000 0.000000 0.000000 8.000000 5.000000 24.0 3.000000 1.000000 2.000000 0.000000 0.000000e+00 NaN 2.667000e+03 0.000000 0.0
25% 12011.00000 1.000000 -1.0 1.000000 5.859000e+03 1264.000000 0.000000 0.000000 0.000000 0.0 2.000000 0.0 1.000000 0.000000 1957.000000 0.0 3.000000 6.000000 0.0 2.000000 5.579400e+04 1.504890e+05 2.400000e+05 2016.000000 3090.000000 NaN 33313.000000 26.178000 -122.284000 NaN 0.000000 1.000000 1.000000 8.000000 5.000000 24.0 12.000000 2.000000 2.000000 0.000000 1.180000e+05 NaN 2.110000e+05 0.000000 0.0
50% 53033.00000 1.000000 -1.0 1.000000 7.741000e+03 1650.000000 0.000000 0.000000 0.000000 0.0 3.000000 0.0 2.000000 1.000000 1976.000000 0.0 3.000000 6.000000 0.0 2.000000 1.179990e+05 2.179990e+05 3.520000e+05 2016.000000 4437.000000 NaN 98027.000000 47.386000 -122.108000 NaN 0.000000 1.000000 1.000000 8.000000 5.000000 24.0 12.000000 2.000000 3.000000 0.000000 2.052000e+05 NaN 3.050000e+05 0.000000 0.0
75% 53033.00000 1.000000 -1.0 1.000000 1.070400e+04 2262.000000 440.000000 670.000000 0.000000 0.0 4.000000 0.0 3.000000 1.000000 1992.000000 0.0 3.000000 8.000000 0.0 5.000000 2.429990e+05 3.099990e+05 5.400000e+05 2016.000000 6222.000000 NaN 98092.000000 47.605000 -80.276000 NaN 1.000000 2.000000 1.000000 8.000000 5.000000 24.0 12.000000 3.000000 8.000000 1.000000 3.394500e+05 NaN 4.475000e+05 0.000000 0.0
max 53033.00000 9.000000 -1.0 4.000000 6.542712e+06 69854.000000 5430.000000 14610.000000 9740.000000 0.0 128.000000 0.0 68.000000 1.000000 2016.000000 0.0 6.000000 10.000000 0.0 5.000000 2.617600e+07 9.881800e+07 1.249940e+08 2016.000000 94182.000000 NaN 98354.000000 47.778000 -80.077000 NaN 3.000000 7.000000 68.000000 21.000000 29.000000 24.0 12.000000 6.000000 15.000000 3.000000 1.280000e+09 NaN 2.200000e+07 5.000000 0.0

Visualizing the missing values

In [24]:
filtered_data = msno.nullity_filter(df_rcon, filter='bottom', p=0.999) # or filter='top'
msno.matrix(filtered_data.sample(300), color=[0.3, 0.3, 0.4])
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aecf4b5470>
In [25]:
filtered_data = msno.nullity_filter(df_rcon, filter='bottom', p=0.999) 
msno.heatmap(filtered_data)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aed14eaeb8>
In [26]:
def get_columns_with_missing_values():
    column_with_missing_values = []
    valid_columns = 0
    for colname in list(df_rcon):
        missing_values_current_column_rcon = len([boolean_value for boolean_value in df_rcon[colname].isnull() 
                                         if boolean_value == True])
        missing_values_current_column_rsfr = len([boolean_value for boolean_value in df_rsfr[colname].isnull() 
                                         if boolean_value == True])
        
        if missing_values_current_column_rcon != 0:
            column_with_missing_values.append((colname + "_RCON", missing_values_current_column_rcon, 
                                              str(missing_values_current_column_rcon / df_rcon.shape[0] * 100) + "%"))
        
        if missing_values_current_column_rsfr != 0:
            column_with_missing_values.append((colname + "_RSFR", missing_values_current_column_rsfr, 
                                              str(missing_values_current_column_rsfr / df_rsfr.shape[0] * 100) + "%"))
            
    return column_with_missing_values
In [27]:
def highlight(row, colors_dict):
    if int(row.name) in colors_dict.keys():
        return ['background-color: ' + colors_dict[int(row.name)]]*3
    else:
        if int(row.name) < max(colors_dict.keys()) - 1:
            return ['background-color: ' + colors_dict[int(row.name) - 1]]*3
        else:
            return ['background-color: ' + colors_dict[int(row.name) - 1]]*3 

def random_color():
    return 'rgba' + str(tuple(np.random.randint(256, size=3)) + (0.2,))
In [28]:
missing_values_df = pd.DataFrame(get_columns_with_missing_values(), 
             columns=['Predictor\'s name', 'No. of missing values', 'Percentage of absence'])

rand_color = randomcolor.RandomColor()
random_colors = {id: random_color() for id in range(0, len(list(missing_values_df.index)),2)}
missing_values_df.style.apply(highlight, colors_dict=random_colors, axis=1)
Out[28]:
Predictor's name No. of missing values Percentage of absence
0 BuildingCode_RCON 21646 65.78931371953072%
1 BuildingCode_RSFR 33099 38.54143620675602%
2 StructureCode_RCON 32899 99.99088201325146%
3 StructureCode_RSFR 85878 99.99883557097777%
4 GarageCarportCode_RCON 24377 74.0897209896055%
5 GarageCarportCode_RSFR 48343 56.29199222161413%
6 PatioPorchCode_RCON 32380 98.41347030575649%
7 PatioPorchCode_RSFR 51421 59.87610475203484%
8 PoolCode_RCON 32881 99.93617409276033%
9 PoolCode_RSFR 73837 85.97794571431898%
10 Zonning_RCON 18586 56.488967236034284%
11 Zonning_RSFR 1895 2.20659299712386%
12 PropTaxAmount_RCON 9 0.027353960245577777%
13 PropTaxAmount_RSFR 25 0.029110725555723752%
14 City_RCON 32902 100.0%
15 City_RSFR 85879 100.0%
16 BuildingShapeCode_RCON 32902 100.0%
17 BuildingShapeCode_RSFR 85879 100.0%
18 FoundationCode_RCON 32875 99.91793811926327%
19 FoundationCode_RSFR 53182 61.92666426018002%
20 ExteriorCode_RCON 32845 99.826758251778%
21 ExteriorCode_RSFR 53047 61.769466342179115%
22 RoofCode_RCON 32902 100.0%
23 RoofCode_RSFR 85878 99.99883557097777%
24 CoolingCode_RCON 32860 99.87234818552064%
25 CoolingCode_RSFR 59825 69.66196625484694%
26 HeatingCode_RCON 31487 95.69934958361193%
27 HeatingCode_RSFR 33074 38.51232548120029%
28 HeatingSourceCode_RCON 31498 95.73278220168987%
29 HeatingSourceCode_RSFR 33108 38.55191606795608%
30 View_RCON 23023 69.97446963710414%
31 View_RSFR 73076 85.09181522840275%
32 DocType_RCON 10310 31.33548112576743%
33 DocType_RSFR 41295 48.08509647294449%
34 DeedType_RCON 32902 100.0%
35 DeedType_RSFR 85879 100.0%
36 TransType_RCON 4944 15.026442161570724%
37 TransType_RSFR 17155 19.975779876337636%
38 DistressCode_RCON 30449 92.54452616862197%
39 DistressCode_RSFR 78347 91.22952060457155%
40 SellPrice_RCON 10917 33.18035377788584%
41 SellPrice_RSFR 46247 53.85134899102225%

Removing columns with missing percentage $\ge$ 0.95

The columns wit missing percentage bigger than 95% in both RCON and RSFR datasets are going to be deleted since they can represent features with a big predictive power while building any type of Machine Learning model.

In [29]:
column_miss_perc_ge_95 = ['DeedType', 'RoofCode', 'BuildingShapeCode', 'City', 'StructureCode']
for column in column_miss_perc_ge_95:
    del df_rcon[column]
    del df_rsfr[column]

Replacing the missing values for the rest of the predictors

In [30]:
other_columns = ['BuildingCode', 'GarageCarportCode', 'PatioPorchCode', 'PoolCode', 'Zonning', 'PropTaxAmount', 'FoundationCode', 'ExteriorCode', 'CoolingCode', 'HeatingCode', 
'HeatingSourceCode', 'View', 'DocType', 'TransType', 'DistressCode', 'SellPrice']
df_rsfr[other_columns].dtypes
Out[30]:
BuildingCode         float64
GarageCarportCode     object
PatioPorchCode       float64
PoolCode              object
Zonning               object
PropTaxAmount        float64
FoundationCode       float64
ExteriorCode         float64
CoolingCode          float64
HeatingCode           object
HeatingSourceCode    float64
View                 float64
DocType               object
TransType             object
DistressCode          object
SellPrice            float64
dtype: object
In [31]:
object_miss_values_features = []
numeric_miss_valuez_features = []
types = df_rsfr[other_columns].dtypes
for i in range(len(types)):
    if types[i] == object:
        object_miss_values_features.append(other_columns[i])
    else:
        numeric_miss_valuez_features.append(other_columns[i])
In [32]:
for column in object_miss_values_features:
    print(column + "  ___________  " + str(set([obs for obs in list(df_rsfr[column]) if pd.isnull(obs) == False])))
    print()
GarageCarportCode  ___________  {'C 0', 'GA0', 'G 0', 'GD0', 'GB0'}

PoolCode  ___________  {'Y'}

Zonning  ___________  {'NC2P40', 'RM-8', 'RS7.2', 'R-4A', 'R1P', 'MB', 'AP', 'RS9600', 'R-4', 'SVV', 'RM-10(7.5)', 'R-3', 'MF2L', 'RAC-ND2', 'RS-9', 'BG', 'RM48', 'NC2P65', 'EH', 'MRG', 'MPD', 'NBDD-DZ', 'PLA 6E', 'RD-6L', 'RMH-60', 'BP', 'LDR', 'PLA 3C', 'E-1', 'RS 11', 'PLA 6D', 'RS-2B', 'RM-45/HR', 'CBSO', 'PD', 'RS-L', 'R-2', 'TSQ', 'RS-3D', 'RE', 'C140', 'RM-30', 'RS7200', 'NC240', 'RS-4K', 'R-1', 'C2P40', 'SFS', 'R-1-C', 'R', 'CA', 'RS 7200', 'CD', 'R-1A', 'MHO', 'MDR', 'RS-3', 'BCE-1', 'NRH 3', 'HCB', 'RS 5', 'R7.5', 'IB U65', 'RA10', 'RM-5', 'RSX 7.2', 'RS-3F', 'MRD', 'MUO', 'R5', 'RPUD', 'RS-4J', 'RM3600', 'MFM', 'UL7200', 'ROC', 'TC4', 'RML', 'RS-1A', 'R4C', 'RM-10N', 'NB', 'AE-2', 'ENSZD', 'BCRS-3', 'RS', 'R3.5', 'R12.5', 'SR30', 'RS-10', 'DUC', 'AG', 'SFSL', 'IDM 7585150', 'A-1', 'RDS-15', 'RM-12C', 'SR8', 'NC365', 'PRD 5.1', 'CBD', 'SGC', 'RM-15', 'RSA 6', 'LR1', 'MRRC', 'R2', 'UNCL', 'CE', 'A10', 'RM-6Y', 'R 4000', 'RS 12.5', 'C130', 'RS 35', 'RS8400', 'RS-5', 'RS-4.4', 'RM-18', 'CCMU', 'RMH-25', 'RM-45', 'B-5', 'RSR', 'PRD', 'RA2.5P', 'SR3', 'SF 5000', 'T', 'UM2400', 'CC2', 'R-1B', 'RA5SO', 'R30', 'PECD', 'DCE', 'AI1', 'CBP', 'ROA', 'R6C', 'RO', 'C-2', 'RM-16C', 'MDR8', 'RC-12', 'PRD-11.6Q', 'O', 'EP1', 'RCM', 'R8PSO', 'PRD-6.3', 'RM2400', 'RS-6', 'RA3600', 'RS-3K', 'RS35', 'PRD-25Q', 'R 8400', 'RA10SO', 'RS-1', 'RA2.5', 'R6', 'PRD-4', 'R3', 'RAC-UV', 'URPSO', 'UH900', 'P', 'TC C', 'R6P', 'R1.8', 'PRD-4Q', 'M1C', 'PLA 17', 'RS-5J', 'MC', 'R-3A', 'RMA 3.6', 'AE-1', 'MRM', 'RAC-RPO', 'SFD', 'RSX 35', 'RM-10C', 'RD-1', 'URP', 'RA5', 'UL15000', 'RS 20000', 'CN', 'RM-10Q', 'RSE', 'R14', 'UL9600', 'EP2', 'PRC', 'IG2 U65', 'TC B', 'MUR45', 'RS-3C', 'RZ', 'PRD-5.8', 'RD-8', 'DT', 'RM-13R', 'NC340', 'RM1800', 'RS-5G', 'R8P', 'PRD-10Q', 'RM-10(5)', 'RSX 8.5', 'SR1', 'RS-5I', 'A-E', 'RM900', 'R4PSO', 'PRD-5Q', 'BO', 'M1', 'RS 7.2', 'RS-4C', 'DBR', 'RM-5C', 'R-5', 'RAC-ND4', 'RM 900', 'BRMO', 'RA10PSO', 'MRT16', 'RS-4', 'RS 9600', 'C2', 'SF 7200', 'RAC-SMU', 'R 7200', 'A10P', 'RR', 'MUR70', 'R12', 'R9.6', 'MIT', 'RM-12', 'RM-16', 'RS-5K', 'RS12000', 'R4', 'RS-3G', 'RS-3K S', 'RSLTC', 'RC-5', 'B', 'CC', 'RMA 1.8', 'NC130', 'RS-2K', 'RM 3.6', 'PBZ', 'R4.5', 'RS 8.5', 'MU', 'NC', 'RM-25', 'R-6', 'PLA 16', 'RIN SINGLE F', 'MUR', 'PRD-15Q', 'GC', 'RS15000', 'R8', 'UR', 'R15', 'EP', 'PRD-3.4', 'RS5', 'SR4.5', 'RC', 'MIO37LR3', 'PRD-3.8', 'RS 6', 'BN', 'R6SO', 'PUD-5', 'JBD 6', 'R1SO', 'SFR 10', 'MR', 'LR3 RC', 'RS-5H', 'LR2', 'PDDC', 'B-3', 'DC', 'POSPF', 'PRD-16Q', 'RS4000', 'PDD', 'NC265', 'NC140', 'B-2A', 'R40', 'RM', 'CR', 'R10', 'NBP', 'RM 3600', 'SF 9600', 'CLI', 'R4P', 'RD-10', 'RS-2', 'TC', 'MUR35', 'RM24', 'R1', 'NBRA', 'O-1', 'TC3', 'NC3P40', 'RMCRA-76', 'M-3', 'RS 10000', 'MFH', 'PRD-5', 'RSX 5', 'PR', 'R-1C', 'HDR', 'RH 8', 'CM2', 'CB', 'R48', 'RM18', 'RA5P', 'R6PSO', 'R 2800, OP, L', 'RSA 1', 'CF-H', 'C-4', 'RS-1EP', 'CBC', 'IG2 U85', 'R 2800, OP', 'LR3', 'C230', 'RA10DPA', 'RS15', 'PUD', 'RMF', 'NC3P85', 'RM12', 'R 40000', 'HC', 'RM-10', 'RM-13C', 'SPI-3', 'C165', 'C240', 'GCMU', 'RC-15', 'UVEV', 'URSO', 'US R1', 'R18P', 'MF-1', 'R16', 'A10SO', 'C3', 'PRD-3Q', 'R-4C', 'RAC-CC', 'R20', 'R 9600, SSHO', 'F', 'R-8', 'UT1', 'C-1', 'IM-1', 'CC1', 'R-1B&C', 'RS 6.3', 'RAC-TC', 'RM-20', 'UV', 'I', 'R-3U', 'R 2800', 'R 5400D', 'RM 5', 'SFE', 'B-1', 'NC2-65', 'RM-9', 'R2.5', 'OS2', 'RCS-15', 'NC2P30', 'RA2.5SO', 'TND', 'RSA 8', 'MSC 1', 'WD III', 'WD II', 'PRD-14.5Q', 'RS-4G', 'RS-5C', 'PLA 7B', 'IB U45', 'R8.4', 'RL', 'BR2', 'C1', 'RS 15000', 'WD I', 'ABC', 'RSA 4', 'RMCRA-18', 'PURD', 'RD-15', 'RS9.6', 'LR2 RC', 'RS-7', 'R24', 'MIO37LR2', 'TC A1', 'R7.2', 'CI', 'A35', 'SR6', 'PRR', 'M2', 'R 9600', 'RS-5K S', 'UH1800', 'B-3A', 'RS-8', 'C-3', 'R20A', 'B-2', 'TC D', 'R18', 'BCRM-5', 'RS-2.5', 'R-1D', 'UHUCR', 'RC-6', 'UM3600', 'GDC', 'R7', 'R4SO'}

HeatingCode  ___________  {'11', '3', '1', '5', '9', '12', '4', 'Y', '2', '13', '20'}

DocType  ___________  {'T', 'Q', 'U', 'W', 'G'}

TransType  ___________  {'S', 'T', 'C', 'R', 'U', 'L'}

DistressCode  ___________  {'I', 'A', 'S', 'G', 'F', 'U', 'L', 'O'}

The missing values for object type columns are going to be replaces with empty strings. For the numeric variables the missing values are going to be replaced with 0, except for SellPrice which is going to be replaced with the LastSalePrice in case of missing value. The other numerical values, even if they are numerical they seem to be descrete, since they are codes representing a category.

In [33]:
numeric_miss_valuez_features
Out[33]:
['BuildingCode',
 'PatioPorchCode',
 'PropTaxAmount',
 'FoundationCode',
 'ExteriorCode',
 'CoolingCode',
 'HeatingSourceCode',
 'View',
 'SellPrice']
In [34]:
for column in object_miss_values_features:
    df_rcon[column] = df_rcon[column].fillna('')
    df_rsfr[column] = df_rsfr[column].fillna('')
In [35]:
rcon_medium_sellprice = df_rcon['SellPrice'].mean()
rsfr_medium_sellprice = df_rsfr['SellPrice'].mean()
print("A medium price for a RCON real estate is = ", rcon_medium_sellprice)
print("A medium price for a RSFR real estate is = ", rsfr_medium_sellprice)
A medium price for a RCON real estate is =  207088.6848305663
A medium price for a RSFR real estate is =  382558.09684093663
In [36]:
for column in numeric_miss_valuez_features:
    if column != 'SellPrice':
        df_rcon[column] = df_rcon[column].fillna(0)
        df_rsfr[column] = df_rsfr[column].fillna(0)

Merging RCON and RSFR dataframes

Both of the dataframes are going to be concatenated into a bigger dataframe. I am going to use another column called isRCON. The value is going be 0 if the real estate is part of RSFR dataframe, and 1 if the house is part of RCON dataframe.

In [37]:
df_rcon['isRCON'] = 1
df_rsfr['isRCON'] = 0
df = pd.concat([df_rcon, df_rsfr])
In [38]:
prices = []
for (index, row) in df.iterrows():
    added = False
    if pd.isnull(row['SellPrice']):
        row['SellPrice'] = row['LastSalePrice']
    if row['SellPrice'] == 0:
        if row['isRCON'] == 1:
            prices.append(rcon_medium_sellprice)
        else:
            prices.append(rsfr_medium_sellprice)
    elif row['SellPrice'] != 0:
        prices.append(row['SellPrice'])
        
df['SellPrice'] = prices       
In [39]:
rcon_medium_sellprice
Out[39]:
207088.6848305663
In [40]:
df.loc[(df['isRCON'] == 1) & (df['LastSalePrice'] == 0), 'LastSalePrice'] = rcon_medium_sellprice
df.loc[(df['isRCON'] == 0) & (df['LastSalePrice'] == 0), 'LastSalePrice'] = rsfr_medium_sellprice

Non-numerical columns

In [41]:
bool_type_columns = df.select_dtypes(include=bool).columns.tolist()
bool_type_columns
Out[41]:
['HasPatioPorch', 'HasPool', 'IsWaterfront', 'ArmsLengthFlag', 'OwnerOccupied']
In [42]:
for column in bool_type_columns:
    df[column] = df[column].apply(lambda x: 0 if x == False else 1)
In [43]:
print("Number of numerical predictors: ", len(list(df.describe())))
Number of numerical predictors:  46
In [44]:
char_type_columns = df.select_dtypes(include=object).columns.tolist()
char_type_columns
Out[44]:
['Condition',
 'Quality',
 'GarageCarportCode',
 'PoolCode',
 'Zonning',
 'State',
 'HeatingCode',
 'LastSaleDate',
 'DocType',
 'TransType',
 'DistressCode',
 'StatusDate',
 'SellDate']
In [45]:
df[df.select_dtypes(include=object).columns.tolist()].head()
Out[45]:
Condition Quality GarageCarportCode PoolCode Zonning State HeatingCode LastSaleDate DocType TransType DistressCode StatusDate SellDate
0 AVE QAV FL 1998-03-27 00:00:00 R 2017-02-10 00:00:00 1998-03-27
1 AVE QAV FL 2006-10-06 00:00:00 W R 2017-02-10 00:00:00 2006-10-06
2 AVE QAV RM-18 FL 2003-12-05 00:00:00 G R 2017-02-10 00:00:00 2003-12-05
3 AVE QAV FL 2006-11-28 00:00:00 W R 2017-02-10 00:00:00 2006-11-28
4 AVE QAV R-4C FL 2009-03-04 00:00:00 G R S 2017-02-10 00:00:00 2009-03-04

Constant features detection

In [46]:
def highlight_constant_features(row):
    if row['Number of different values RCON'] == 1 and row['Number of different values RSFR'] == 1:
        return ['background-color: ' + 'rgb(255, 102, 102, 0.6)']*3
    else:
        return ['background-color: ' + 'rgba(255, 255, 255, 0.18)']*3

columns = list(df_rcon)
value_counts = pd.DataFrame({"Predictor\'s Name": columns, 
              "Number of different values RCON": [len(list(df_rcon[column].value_counts())) for column in columns],
              "Number of different values RSFR": [len(list(df_rsfr[column].value_counts())) for column in columns]})
value_counts.style.apply(highlight_constant_features, axis=1)
Out[46]:
Predictor's Name Number of different values RCON Number of different values RSFR
0 CountyFipsCode 2 2
1 BuildingCode 5 4
2 StructureNbr 4 5
3 LandSqft 2047 17719
4 LivingSqft 2022 4344
5 GarageSqft 105 510
6 BasementSqft 178 664
7 BasementFinishedSqft 123 437
8 AtticSqft 1 1
9 Bedrooms 8 16
10 TotalRooms 1 1
11 TotalBaths 17 35
12 FirePlaces 2 2
13 YearBuilt 62 67
14 EffectiveYearBuilt 1 1
15 Condition 5 5
16 ConditionCode 5 6
17 Quality 5 7
18 QualityCode 7 10
19 GarageCarportCode 18 6
20 GarageNoOfCars 1 1
21 HasPatioPorch 2 2
22 PatioPorchCode 4 4
23 HasPool 2 2
24 PoolCode 2 2
25 Zonning 324 452
26 LandValue 6426 12636
27 ImprovementValue 11514 22945
28 TotalValue 8696 25328
29 AssessedYear 4 6
30 PropTaxAmount 7027 13403
31 State 2 2
32 Zip 125 134
33 Latitude 825 990
34 Longitude 719 1221
35 ConstructionCode 4 4
36 Stories 30 7
37 UnitsInBuilding 185 11
38 FoundationCode 2 3
39 ExteriorCode 3 7
40 CoolingCode 4 4
41 HeatingCode 11 12
42 HeatingSourceCode 6 6
43 IsWaterfront 2 2
44 View 10 8
45 ViewScore 5 4
46 LastSaleDate 5368 7377
47 LastSalePrice 4507 10549
48 DocType 6 6
49 TransType 7 7
50 ArmsLengthFlag 2 2
51 DistressCode 9 9
52 StatusDate 1844 3364
53 SellDate 5497 7656
54 SellPrice 3708 6530
55 OwnerOccupied 2 2
56 DistrsdProp 5 5
57 IsFixer 1 1
58 isRCON 1 1
In [47]:
columns_constant_features = ['AtticSqft', 'IsFixer', 'GarageNoOfCars', 'EffectiveYearBuilt', 'TotalRooms']
for current in columns_constant_features:
    del df[current]

Grouping semantically the predictors

The features which are semantically similar are going to be grouped into the same category, in a global dictionary, so that their analysis will be facilitated.

In [48]:
features_grouped_dict = {
    'code_related': ['CountyFipsCode',
                     'BuildingCode',
                     'ConditionCode',
                     'QualityCode',
                     'GarageCarportCode',
                     'PatioPorchCode',
                     'PoolCode',
                     'ConstructionCode',
                     'FoundationCode',
                     'ExteriorCode',
                     'CoolingCode',
                     'HeatingCode',
                     'HeatingSourceCode',
                     'DistressCode',
                     'Condition'],
    'dimension_related': ['LandSqft',
                          'LivingSqft',
                          'GarageSqft', 
                          'BasementSqft', 
                          'BasementFinishedSqft'],
    'number_of': ['StructureNbr',
                  'Bedrooms', 
                  'TotalBaths', 
                  'FirePlaces',
                  'Stories',
                  'UnitsInBuilding',
                 ],
    'bool_type': ['HasPatioPorch', 
                  'HasPool', 
                  'IsWaterfront', 
                  'ArmsLengthFlag', 
                  'OwnerOccupied'],
    'quality_related': ['Quality',
                        'View'
                        ],
    'location_related': ['Zip',
                         'Latitude',
                         'Longitude',
                         'State',
                         'Zonning'],
    'time_related': ['StatusDate',
                     'LastSaleDate',
                     'YearBuilt',
                     'AssessedYear',
                     'SellDate'],
    'type_related': ['DocType', 
                     'TransType'],
    'price_related_regression': ['SellPrice',
                                 'LastSalePrice',
                                 'LandValue', 
                                 'ImprovementValue', 
                                 'TotalValue',
                                 'PropTaxAmount'],
    'classification_related': ['isRCON']
}
In [49]:
features_grouped_dict['price_related_regression']
Out[49]:
['SellPrice',
 'LastSalePrice',
 'LandValue',
 'ImprovementValue',
 'TotalValue',
 'PropTaxAmount']
In [50]:
df[features_grouped_dict['price_related_regression']].head()
Out[50]:
SellPrice LastSalePrice LandValue ImprovementValue TotalValue PropTaxAmount
0 40000.0 40000.0 6059 54499 60560 1616.0
1 100000.0 100000.0 5079 45759 50840 1560.0
2 78000.0 78000.0 7439 66979 74420 420.0
3 111500.0 111500.0 4929 44329 49260 1300.0
4 85500.0 85500.0 13959 125669 139630 880.0
In [51]:
columns = list(features_grouped_dict['price_related_regression'])
value_counts = pd.DataFrame({"Predictor\'s Name": columns, 
              "Number of different values RCON": [len(list(df_rcon[column].value_counts())) for column in columns],
              "Number of different values RSFR": [len(list(df_rsfr[column].value_counts())) for column in columns]})
value_counts.style.apply(highlight_constant_features, axis=1)
Out[51]:
Predictor's Name Number of different values RCON Number of different values RSFR
0 SellPrice 3708 6530
1 LastSalePrice 4507 10549
2 LandValue 6426 12636
3 ImprovementValue 11514 22945
4 TotalValue 8696 25328
5 PropTaxAmount 7027 13403

Central tendency details

In [52]:
df['SellPrice'].mean()
Out[52]:
308642.4401809113
In [53]:
df['SellPrice'].median()
Out[53]:
207088.6848305663
In [54]:
df['SellPrice'].mode()
Out[54]:
0    382558.096841
dtype: float64

As it can be deducted from the above most of the houses have the price 382558$ which is equal to the mean of the RSFR houses selling price, thing that was expected after replacing the missing values.

Distribution visualization

In [55]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))

ax.hist([list(df['SellPrice']),
          list(df[df['isRCON'] == 1]['SellPrice']), 
          list(df[df['isRCON'] == 0]['SellPrice'])], 
          bins = 25, stacked=True, fill=True, 
          alpha = 0.37, color=['green', 'purple', 'red'])

ax.set_title('Distribution of the houses depending on the selling price')
ax.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax.set_ylabel('Frequency')
ax.set_xlabel('SellPrice')
plt.show()

Since the percentage of the misssing values for the output variable SalePrice was bigger than 30% for both RSFR and RCON properties (replaced in a next preprocessing step with the mean), I am going to redo the stacked histogram for exploring the distribution of the initial prices in the dataset.

In [56]:
rcon_medium_sellprice
Out[56]:
207088.6848305663
In [57]:
rsfr_medium_sellprice
Out[57]:
382558.09684093663
In [58]:
fig, ax0 = plt.subplots(nrows=1, ncols=1, figsize=(10, 6))
mean_sellprice_rcon = df['SellPrice'].mean()

ax0.hist([list(df[(df['SellPrice'] != rcon_medium_sellprice) & (df['SellPrice'] != rsfr_medium_sellprice)]['SellPrice']),
          list(df[(df['isRCON'] == 1) & (df['SellPrice'] != rcon_medium_sellprice)]['SellPrice']), 
          list(df[(df['isRCON'] == 0) & (df['SellPrice'] != rsfr_medium_sellprice)]['SellPrice'])], 
          bins = 20, fill=True, alpha=0.7,
          linewidth=2, histtype='bar', 
          color=['#ffb366', '#6666ff', '#ff66b3'])

ax0.set_title('Distribution of the houses depending on the selling price \n The houses that had initially missing values were excluded', fontsize=10)
ax0.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax0.set_ylabel('Frequency')
ax0.set_xlabel('SellPrice')
plt.show()

It seems that most of the houses have the sale price smaller than 2.500.000\$ both for RCON and RSFR dataframes. I am going to verify which is the number of the houses with a price bigger than this financial amount.

In [59]:
str(len(list(df_rsfr[df_rsfr['SellPrice'] > 2500000]['SellPrice'])) / len(list(df_rsfr['SellPrice']))) + "% for RSFR"
Out[59]:
'0.0015719791800090826% for RSFR'
In [60]:
str(len(list(df_rcon[df_rcon['SellPrice'] > 2500000]['SellPrice'])) / len(list(df_rcon['SellPrice']))) + "% for RCON"
Out[60]:
'0.00048629262658804933% for RCON'

I am going to drop those values whose price is bigger than 2.500.000$, since not many observations will respect this condition. Both datasets have sufficient values so that I can drop some of them, in the purpose of constructing a more robust regression/classification model (the model should not be constructed based on particularities but on general observable patterns).

In [61]:
df = df.drop(df[df['SellPrice'] > 2500000].index)

I am going to rebuild the histograms to see the distribution of the price among the real estates listed in the samples.

In [62]:
fig, ax0 = plt.subplots(nrows=1, ncols=1, figsize=(15, 7))
mean_sellprice_rcon = df['SellPrice'].mean()

ax0.hist([list(df[(df['SellPrice'] != rcon_medium_sellprice) & (df['SellPrice'] != rsfr_medium_sellprice)]['SellPrice']),
          list(df[(df['isRCON'] == 1) & (df['SellPrice'] != rcon_medium_sellprice)]['SellPrice']), 
          list(df[(df['isRCON'] == 0) & (df['SellPrice'] != rsfr_medium_sellprice)]['SellPrice'])], 
          bins = 20, fill=True, alpha=0.7,
          linewidth=2, histtype='bar', 
          color=['#ffb366', '#6666ff', '#ff66b3'])

ax0.set_title('Distribution of the houses depending on the selling price \n The houses that had initially missing values were excluded', fontsize=10)
ax0.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax0.set_ylabel('Frequency')
ax0.set_xlabel('SellPrice')
plt.show()
In [63]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 7))

ax.hist([list(df['SellPrice']),
          list(df[df['isRCON'] == 1]['SellPrice']), 
          list(df[df['isRCON'] == 0]['SellPrice'])], 
          bins = 25, stacked=True, fill=True, 
          alpha = 0.37, color=['green', 'purple', 'red'])

ax.set_title('Distribution of the houses depending on the selling price')
ax.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax.set_ylabel('Frequency')
ax.set_xlabel('SellPrice')
plt.show()
In [64]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(15, 7))
sns.distplot(df['SellPrice'], hist = False, kde = True,
                 kde_kws = {'shade': True, 'linewidth': 3}, 
                 label = 'Distribution RCON + RSFR')
sns.distplot(df[df['isRCON'] == 1]['SellPrice'], hist = False, kde = True,
                 kde_kws = {'shade': True, 'linewidth': 3}, 
                 label = 'Distribution RCON')
sns.distplot(df[df['isRCON'] == 0]['SellPrice'], hist = False, kde = True,
                 kde_kws = {'shade': True, 'linewidth': 3}, 
                 label = 'Distribution RSFR')
plt.title('Density plot SellPrice')
plt.show()

Skewness & kurtosis information

title

In [65]:
print(skew(df['SellPrice']))
print(skew(df[df['isRCON'] == 1]['SellPrice']))
print(skew(df[df['isRCON'] == 0]['SellPrice']))
3.134642474952049
4.1102041762408845
2.998641432913546

title

In [66]:
print(kurtosis(df['SellPrice']))
print(kurtosis(df[df['isRCON'] == 1]['SellPrice']))
print(kurtosis(df[df['isRCON'] == 0]['SellPrice']))
17.136540571665492
29.49461674797743
15.647678114540248

All three distributions are leptokurtic, thing that can be also noticed from the density plot presented in the previous section.

In [67]:
continuous_variables = features_grouped_dict['price_related_regression'][1:-1]
fig, ax = plt.subplots(6,1,figsize=(15,10))

for i in range(len(continuous_variables)):
    plt.subplot(2,2,i+1)
    plt.scatter(df[continuous_variables[i]], df['SellPrice'], c=df['isRCON'], alpha=0.2, marker="1")
    plt.xlabel(continuous_variables[i], fontsize=12)
    plt.ylabel('SellPrice', fontsize=12)
    plt.xlim(-1e02, 1e6)
    
plt.show()
In [68]:
plt.figure(figsize = (15,7))
plt.scatter(df['PropTaxAmount'], df['SellPrice'], c=df['isRCON'], alpha=0.2, marker="1")
plt.xlabel(continuous_variables[i], fontsize=12)
plt.ylabel('SellPrice', fontsize=12)
plt.xlim(-1e02, 7e4)
plt.show()
In [69]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(8, 5))
sns.distplot(df['PropTaxAmount'], hist = False, kde = True, kde_kws = {'shade': True, 'linewidth': 3}, 
                 label = 'Distribution RSFR')
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x2aedae1ad68>
In [70]:
features_grouped_dict['time_related']
Out[70]:
['StatusDate', 'LastSaleDate', 'YearBuilt', 'AssessedYear', 'SellDate']

LastSaleDate vs. SellDate

At a first glance it seems that the SellDate is identical to LastSaleDate, with the mention that LastSaleDate has a different format: YYYY-MM-DD HH:mm:ss. The 'HH:mm:ss' is irrelevant in the context of real estates selling since we are interested only in the day specifications. I am going to verify the equivalence in the next sections:

In [71]:
df['LastSaleDate'] = df['LastSaleDate'].apply(lambda x: str(x.split()[0]))
df['SellDate'] = df['SellDate'].astype(str)
df['LastSaleDate'].head()
Out[71]:
0    1998-03-27
1    2006-10-06
2    2003-12-05
3    2006-11-28
4    2009-03-04
Name: LastSaleDate, dtype: object
In [72]:
list(df['LastSaleDate']) == list(df['SellDate']) # are these two equivalent??
Out[72]:
False
In [73]:
def see_percentage_of_dif_LastSaleDate_SellDate(df, dataset_name):
    no_differences = 0
    total = len(list(df['LastSaleDate']))
    a = list(df['LastSaleDate'])
    b = list(df['SellDate'])
    for index in range(total):
        if a[index] != b[index]:
            no_differences += 1

    print(str(no_differences / total * 100) + "% " + dataset_name)
    
see_percentage_of_dif_LastSaleDate_SellDate(df[df['isRCON'] == 1], "RCON")
see_percentage_of_dif_LastSaleDate_SellDate(df[df['isRCON'] == 0], "RSFR")
1.5422963078361116% RCON
2.250629112862257% RSFR

Since in the RSFR dataset the LastSaleDate is completely different than SellDate, there are going to be kept both columns.

In [74]:
df['SellDate_Year'] = df['SellDate'].apply(lambda x: int(x[:4]))
df['SellDate_Month'] = df['SellDate'].apply(lambda x: int(x[5:7]))
df['SellDate_Day'] = df['SellDate'].apply(lambda x: int(x[8:]))
In [75]:
df[['SellDate_Year', 'SellDate_Month', 'SellDate_Day', 'SellDate']].head()
Out[75]:
SellDate_Year SellDate_Month SellDate_Day SellDate
0 1998 3 27 1998-03-27
1 2006 10 6 2006-10-06
2 2003 12 5 2003-12-05
3 2006 11 28 2006-11-28
4 2009 3 4 2009-03-04
In [76]:
del df['SellDate']
In [77]:
features_grouped_dict['time_related'].append('SellDate_Year')
features_grouped_dict['time_related'].append('SellDate_Month')
features_grouped_dict['time_related'].append('SellDate_Day')
features_grouped_dict['time_related'].remove('SellDate')
features_grouped_dict['time_related']
Out[77]:
['StatusDate',
 'LastSaleDate',
 'YearBuilt',
 'AssessedYear',
 'SellDate_Year',
 'SellDate_Month',
 'SellDate_Day']

StatusDate column

In [78]:
df[['StatusDate', 'LastSaleDate', 'YearBuilt', 'SellDate_Year', 'SellDate_Month']].head(10)
Out[78]:
StatusDate LastSaleDate YearBuilt SellDate_Year SellDate_Month
0 2017-02-10 00:00:00 1998-03-27 1977 1998 3
1 2017-02-10 00:00:00 2006-10-06 1973 2006 10
2 2017-02-10 00:00:00 2003-12-05 1968 2003 12
3 2017-02-10 00:00:00 2006-11-28 1989 2006 11
4 2017-02-10 00:00:00 2009-03-04 1988 2009 3
5 2017-02-10 00:00:00 1995-12-20 1995 1995 12
6 2017-02-10 00:00:00 1979-10-01 1974 1979 10
7 2017-02-10 00:00:00 2005-05-31 1968 2005 5
8 2017-02-10 00:00:00 2007-06-12 1973 2007 6
9 2017-02-10 00:00:00 1999-11-15 1970 1999 11
In [79]:
df['StatusDate'] = df['StatusDate'].apply(lambda x: str(x.split()[0]))
In [80]:
str(len(set(list(df['StatusDate'])))) + " different values for StatusDate"
Out[80]:
'1929 different values for StatusDate'

YearBuilt column

In [81]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))
ax0, ax1, ax2, ax3 = axes.flatten()

ax0.hist([list(df['YearBuilt']),
          list(df[df['isRCON'] == 1]['YearBuilt']), 
          list(df[df['isRCON'] == 0]['YearBuilt'])], 
         bins=10, histtype='bar')

ax0.set_title('Distribution of the houses depending on YearBuild', fontsize=10)
ax0.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax0.set_ylabel('Frequency')
ax0.set_xlabel('YearBuilt')


# ------------------- SellDate_Day ----------------------------

ax1.hist([list(df['SellDate_Day']),
          list(df[df['isRCON'] == 1]['SellDate_Day']), 
          list(df[df['isRCON'] == 0]['SellDate_Day'])], 
          bins=31, histtype='step', stacked=True, fill=False,
          linewidth=5, alpha=0.8, color=['#96ceb4', '#ff6f69', '#ffeead'])

ax1.set_title('Distribution of the houses depending on the day in which are sold', fontsize=10)
ax1.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax1.set_ylabel('Frequency')
ax1.set_xlabel('Day in which are sold')

# ------------------- SellDate_Month ----------------------------

ax2.hist([list(df['SellDate_Month']),
          list(df[df['isRCON'] == 1]['SellDate_Month']), 
          list(df[df['isRCON'] == 0]['SellDate_Month'])], 
          bins=12, alpha=0.4, histtype='bar', stacked=True, 
          color=['#ffcc66', '#0033cc', '#006666'],
          linestyle='dashed', lw=3)

ax2.set_title('Distribution of the houses depending on the month in which are sold', fontsize=10)
ax2.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax2.set_ylabel('Frequency')
ax2.set_xlabel('Month in which are sold')

# ------------------- SellDate_Year ----------------------------

ax3.hist([list(df['SellDate_Year']),
          list(df[df['isRCON'] == 1]['SellDate_Year']), 
          list(df[df['isRCON'] == 0]['SellDate_Year'])], 
          bins=15, histtype='bar', color=['#ff9900', '#0033cc', '#00cc99'])

ax3.set_title('Distribution of the houses depending on the year in which are sold', fontsize=10)
ax3.legend(['Frequency of RCON + RSFR', 'Frequency of RCON', 'Frequency of RSFR'], prop={'size': 10})
ax3.set_ylabel('Frequency')
ax3.set_xlabel('Year in which are sold')

plt.show()

The distributions for the number of the houses sold in a specific year seems not to be identical in the case of RCON vs. RSFR datasets. For RCON dataset it seems that the distribution is closer to a Gaussian one. Meanwhile, when talking about RFSR dataset the distribuition is closer to a random uniform distribution.

In [82]:
fig, ax = plt.subplots(figsize=(17, 5))
sns.boxplot(df['SellDate_Day'], df['SellPrice'])
plt.show()
In [83]:
fig, ax = plt.subplots(figsize=(5, 8))
sns.boxplot(df['SellDate_Month'], df['SellPrice'])
plt.show()

From the above boxplots, it can be observed the fact that in many of the cases the selling of the real estates has a seasonality, many of the houses sold during a year, despite the day of the week or month requirements are evaluated around less than 500.000$.

There can be observed a lot of outliers, but their distribution seems at a first glance to be uniform during a year or a month, this meaning that in a day or in a month cheaper properties are sold preponderantly, but more expensive houses are sold as well.

Pearson's correlation matrix

Having too many features in a model is not always a good thing because it might cause overfitting and worser results when we want to predict values for a new dataset. Thus, if a feature does not improve your model a lot, not adding it may be a better choice.

In [84]:
# df_only_numerical = df[df.columns[~df.columns.isin(char_type_columns)]]
matplotlib.style.use('ggplot')
f, ax = plt.subplots(figsize=(14, 14))
plt.imshow(df.corr(), cmap="viridis", interpolation='nearest')
plt.title('Pearson Correlation among features')
plt.colorbar()

tick_marks = [i for i in range(len(df.columns))]
plt.xticks(tick_marks, df.columns, rotation='vertical')
plt.yticks(tick_marks, df.columns)
plt.show()

Another important thing is correlation. If there is very high correlation between two features, keeping both of them is not a good idea most of the time not to cause overfitting. For instance, if there is overfitting, we may remove CountyFipsCode or Zip because they are highly correlated. This relation can be estimated when we look at the definitions in the dataset but to be sure correlation matrix should be checked. However, this does not mean that you must remove one of the highly correlated features. For example: Zip and FirePlaces. They are highly correlated but I do not think that the relation among them is the same as the relation between Zip and CountyFipsCode.

The correlation matrix will show how correlate the variables are with each other, and more relevant with the independent variable, the sale price.

In [85]:
corr = df.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(30, 16))
sns.heatmap(df.corr(), annot=True, fmt=".2f", mask=mask, cmap='magma')
plt.show()

Independence tests: chi2_contingency ; Association between categorical features

The categorical features are going to be filtered by pursuing some chi-squared tests of independence. The 'so-said' correlation between isRCON and each categorical predictor is going to be measured by the Cramer's V value.

For this task the Chi-square test of independence is going to be used. Chi-square test of independence checks if there is a relationship between two nominal variables. We are considering for instance, two relevant categorical variables: Quality and isRCON, i.e. is going to be tested if the quality of the buildings is higher or lower depending on the real estate type.

The H0 (Null Hypothesis): There is no relationship between Quality and variable isRCON.

The H1 (Alternative Hypothesis): There is a relationship between Quality and isRCON.

If the p-value is significant (as close as possible to 0, preferable smaller than 0.05), you can reject the null hypothesis and claim that the findings support the alternative hypothesis.

In [86]:
contingency_table = pd.crosstab(df['Quality'], df['isRCON'])
contingency_table
Out[86]:
isRCON 0 1
Quality
000 259 420
QAV 53718 26097
QEX 2325 3064
QFA 6363 194
QGO 21858 2320
QLU 1 0
QPO 119 0
In [87]:
import researchpy as rp

contingency_table, independence_test_results = rp.crosstab(df['Quality'], df['isRCON'], prop='col', test='chi-square')
independence_test_results
Out[87]:
Chi-square test results
0 Pearson Chi-square ( 6.0) = 9727.8648
1 p-value = 0.0000
2 Cramer's V = 0.2887

Cramér's V (sometimes referred to as Cramér's phi and denoted as φc) is a measure of association between two nominal variables.

In [88]:
from scipy.stats import chi2_contingency

def cramers_v(confusion_matrix):
    """ calculate Cramers V statistic for categorial-categorial association.
        uses correction from Bergsma and Wicher,
        Journal of the Korean Statistical Society 42 (2013): 323-328
    """
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

def get_cramers_v_for_given_features(feature1, feature2):
    confusion_matrix = pd.crosstab(df[feature1], df[feature2]).as_matrix()
    return cramers_v(confusion_matrix)

get_cramers_v_for_given_features('Quality', 'isRCON')
Out[88]:
0.2885828528393876

Independence Tests: T-test for the means

Calculate the T-test for the means of two independent samples of scores.

This is a two-sided test for the null hypothesis that 2 independent samples have identical average (expected) values. This test assumes that the populations have identical variances by default.

In [89]:
from scipy.stats import ttest_ind
stats.ttest_ind(df[df['isRCON']==1]['SellPrice'], df[df['isRCON']==0]['SellPrice']) # returns t-statistic and p-value
Out[89]:
Ttest_indResult(statistic=-67.66483133587826, pvalue=0.0)
In [90]:
features_grouped_dict['location_related']
Out[90]:
['Zip', 'Latitude', 'Longitude', 'State', 'Zonning']
In [91]:
df['State'].value_counts()
Out[91]:
WA    63356
FL    53382
Name: State, dtype: int64

Since the real estates covered into the dataframe are belonging to two different states, it is expected that when plotting on a map the latitude and longitude coordinates they will be grouped into two different clusters.

In [92]:
plt.title('Localization of the real estates', fontsize=12)
plt.scatter(df['Latitude'], df['Longitude'], c=df['SellPrice'], alpha=0.2, marker="1")
plt.xlabel('Latitude', fontsize=12)
plt.ylabel('Longitude', fontsize=12)
plt.show()

As it can be seen from the previous graphic, the real estates covered in the considered sample is split into two groups:

  • one having latitude coordinates $\in [25, 30] $ and longitude coordinates aroung the value of $-80$
  • one having latitude coordinates $\in [45, 50] $ and longitude coordinates aroung the value of $-121$

Abstract maps

In [93]:
plt.figure(figsize=(20, 20))
plt.title('Abstract reconstruction of the map: Sold Houses Regions by price: Latitude in [45, 50] and Longitude apx -121')
plt.scatter(x=df[df['Latitude'] > 30]['Latitude'],
            y=df[df['Latitude'] > 30]['Longitude'], 
            c=df[df['Latitude'] > 30]['SellPrice'],
            alpha=0.7)
plt.xlabel('Latitude', fontsize=12); plt.ylabel('Longitude', fontsize=12)
plt.show();
In [94]:
plt.figure(figsize=(20, 20))
plt.title('Abstract reconstruction of the map: Sold Houses Regions by price: Latitude in [25, 30] and Longitude apx -80')
plt.scatter(x=df[df['Latitude'] < 30]['Latitude'],
            y=df[df['Latitude'] < 30]['Longitude'], 
            c=df[df['Latitude'] < 30]['SellPrice'], 
            alpha=0.8)
plt.xlabel('Latitude', fontsize=12); plt.ylabel('Longitude', fontsize=12)
plt.show();
In [95]:
fig, ax = plt.subplots(1,2,figsize=(30,25))

plt.subplot(2,2,1)
plt.title('Location for RCON/RSFR houses: \n Latitude in [25, 30] and Longitude apx -80', fontsize=10)
plt.scatter(x=df[df['Latitude'] < 30]['Latitude'],
            y=df[df['Latitude'] < 30]['Longitude'], 
            c=df[df['Latitude'] < 30]['isRCON'],
            alpha=0.4)
plt.xlabel('Latitude', fontsize=12)
plt.ylabel('Longitude', fontsize=12)


plt.subplot(2,2,2)
plt.title('Location for RCON/RSFR houses: \n Latitude in [45, 50] and Longitude apx -121', fontsize=10)
plt.scatter(x=df[df['Latitude'] > 30]['Latitude'],
            y=df[df['Latitude'] > 30]['Longitude'], 
            c=df[df['Latitude'] > 30]['isRCON'], 
            alpha=0.4)
plt.xlabel('Latitude', fontsize=12)
plt.ylabel('Longitude', fontsize=12)
plt.show();

As it can be seen from the above representations, RCON and RSFR properties are mixed in both of the states. It is difficult to notice a big spot or land surface in which only RCON or only RSFR are exclusively sold.

In [96]:
base_map = folium.Map(location=[26.15, -80.25], control_scale=True, zoom_start=9.4)

df_copy = df[df['Latitude'] < 30].copy()
df_copy['Count'] = 1 # to see how many houses have been sold in a specific area 

# add carton position map
folium.TileLayer('cartodbpositron').add_to(base_map)

# add heatmap
HeatMap(data=df_copy[['Latitude','Longitude','Count']].groupby(['Latitude','Longitude']).sum().reset_index().values.tolist(),
        radius = 8, max_zoom = 13,name = 'Heat Map').add_to(base_map)

folium.LayerControl(collapsed=True).add_to(base_map)
base_map
Out[96]:
In [97]:
base_map = folium.Map(location=[47.5, -121.8], control_scale=True, zoom_start=9.4)

df_copy = df[df['Latitude'] > 30].copy()
df_copy['Count'] = 1 # to see how many houses have been sold in a specific area 

# add carton position map
folium.TileLayer('cartodbpositron').add_to(base_map)

# add heatmap
HeatMap(data=df_copy[['Latitude','Longitude','Count']].groupby(['Latitude','Longitude']).sum().reset_index().values.tolist(),
        radius = 8, max_zoom = 13,name = 'Heat Map').add_to(base_map)

folium.LayerControl(collapsed=True).add_to(base_map)
base_map
Out[97]:

From the above maps, there can be observed the fact the real estates from the dataframes are grouped in two clusters. One is situated on the East coast of USA, near Miami, and another one is on the North-West coast of the USA. The number of the real estates which are sold is bigger in the center of the clusters than in the suburbs.

In [98]:
len(df['Zonning'].value_counts().index)
Out[98]:
571
In [99]:
df['Zonning'].head()
Out[99]:
0         
1         
2    RM-18
3         
4     R-4C
Name: Zonning, dtype: object

For the Zonning feature this seems to represent a code, precific for the property's localization.

3D visualizations

In [100]:
fig = plt.figure(figsize=(15,15))

df_copy = df[df['Latitude'] < 30].copy()

fig = px.scatter_3d(df_copy, 
                    x='Latitude',
                    y='Longitude', 
                    z='SellPrice',
                    color='SellPrice', 
                    opacity=0.9, 
                    size='SellPrice', 
                    title='3d price distribution based on geographical coordinates Latitude < 30')
fig.show()
<Figure size 1080x1080 with 0 Axes>
In [101]:
def get_3d_graph_popular_zones_for_selling(df, color = '#5e384b'):
    df_copy = df.copy()
    df_copy['Count'] = 1
    df_copy['Latitude_rounded'] = round(df_copy['Latitude'],2)
    df_copy['Longitude_rounded'] = round(df_copy['Longitude'], 2)

    no_houses = df_copy[['Latitude_rounded','Longitude_rounded','Count']]\
                        .groupby(['Latitude_rounded','Longitude_rounded'])\
                        .sum().reset_index().values.tolist()
    no_houses = [tuple(current) for current in no_houses]
    no_houses = list(zip(*no_houses))   

    dict_occ = {str(elem): 0 for elem in set(list(zip(no_houses[0], no_houses[1])))}
    for i in range(len(no_houses[0])):
        key = str(tuple([no_houses[0][i], no_houses[1][i]]))
        dict_occ[key] = int(no_houses[2][i])

        z = list(dict_occ.values())
    x = []
    y = []

    for key in dict_occ.keys():
        key_tuple = eval(key)
        x.append(key_tuple[0])
        y.append(key_tuple[1])
    
    fig = plt.figure(figsize=(15, 10))
    ax1 = fig.add_subplot(111, projection='3d')

    num_elements = len(x)
    dx = [0.007 for i in range(num_elements)]
    dy = [0.007 for i in range(num_elements)]
    dz = z

    ax1.bar3d(x=x, y=y, z=[0 for i in range(num_elements)], dx=dx, dy=dy, dz=z, color=color)
    plt.title('Zones with the biggest number of houses for selling')
    ax1.set_xlabel('Latitude rounded')
    ax1.set_ylabel('Longitude rounded')
    ax1.set_zlabel('Number of houses')
    plt.show()
        
get_3d_graph_popular_zones_for_selling(df[df['Latitude'] < 30].copy())
In [102]:
get_3d_graph_popular_zones_for_selling(df[(df['Latitude'] > 30) & (df['Longitude'] < -121.8)].copy(), color = '#d59c8f')
In [103]:
columns = features_grouped_dict['code_related']
value_counts = pd.DataFrame({"Predictor\'s Name": columns, 
              "Number of different values RCON": [len(list(df[df['isRCON']==1][column].value_counts())) for column in columns],
              "Number of different values RSFR": [len(list(df[df['isRCON']==0][column].value_counts())) for column in columns]})
value_counts
Out[103]:
Predictor's Name Number of different values RCON Number of different values RSFR
0 CountyFipsCode 2 2
1 BuildingCode 5 4
2 ConditionCode 5 6
3 QualityCode 7 10
4 GarageCarportCode 18 6
5 PatioPorchCode 4 4
6 PoolCode 2 2
7 ConstructionCode 4 4
8 FoundationCode 2 3
9 ExteriorCode 3 7
10 CoolingCode 4 4
11 HeatingCode 11 12
12 HeatingSourceCode 6 6
13 DistressCode 9 9
14 Condition 5 5
In [104]:
h = df[features_grouped_dict['code_related']].hist(bins=15, figsize=(9,15),
                                                   xlabelsize='10',
                                                   ylabelsize='10',
                                                   color='#74d2b3')
sns.despine(left=True, bottom=True)
[x.title.set_size(10) for x in h.ravel()];
[x.yaxis.tick_left() for x in h.ravel()];
In [105]:
features_grouped_dict['dimension_related']
Out[105]:
['LandSqft',
 'LivingSqft',
 'GarageSqft',
 'BasementSqft',
 'BasementFinishedSqft']

Price per living sqft analysis

In [106]:
df['SellPricePerLivingSqft'] = df['SellPrice'] / df['LivingSqft']
features_grouped_dict['price_related_regression'].append('SellPricePerLivingSqft')
df['SellPricePerLivingSqft'].head()
Out[106]:
0     33.898305
1    125.000000
2     94.545455
3    148.666667
4     68.400000
Name: SellPricePerLivingSqft, dtype: float64
In [107]:
fig, ax = plt.subplots(1,1,figsize=(7,7))
plt.scatter(df[df['SellPricePerLivingSqft']< 6000]['SellPricePerLivingSqft'],
            df[df['SellPricePerLivingSqft']< 6000]['SellPrice'], 
            c=df[df['SellPricePerLivingSqft']< 6000]['SellPricePerLivingSqft'], 
            alpha=0.2, marker="1")
plt.xlabel('SellPricePerLivingSqft', fontsize=12)
plt.ylabel('SellPrice', fontsize=12)
plt.show()
In [108]:
pearsonr(df['SellPricePerLivingSqft'], df['SellPrice'])
Out[108]:
(0.3497243346650059, 0.0)

Between the 'SellPrice' of a house and 'SellPricePerLivingSqft' variable it is also a slightly tendency of pozitive correlation, fact which is intuitive since if the selling price is bigger it is quite certainely that the SellPricePerLivingSqft will be bigger.

In [109]:
aux = features_grouped_dict['dimension_related']
aux.append('SellPricePerLivingSqft')
In [110]:
sns.pairplot(df[aux], kind="scatter", palette='husl')
plt.show()

There are slightly tendencies of positive correlations between group of variables such as 'BasementSqft' and 'BasementFinishedSqft' or between 'BasementSqft' and 'GarageSqft'. It can also be noticed the fact that if the 'LandSqft' is bigger the 'SellPricePerLivingSqft' will also increase its value.

In [111]:
fig, ax = plt.subplots(2,2,figsize=(15,15))

continuous_variables = ['LandSqft', 'LivingSqft','GarageSqft','BasementSqft']

for i in range(len(continuous_variables)):
    plt.subplot(3,2,i+1)
    plt.scatter(df[continuous_variables[i]], df['SellPrice'], c=df['SellPrice'], alpha=0.2, marker="1")
    plt.xlabel(continuous_variables[i], fontsize=12)
    plt.ylabel('SellPrice', fontsize=12)
plt.show()
In [112]:
plt.boxplot(df['LivingSqft'])
plt.show()

It seems that many of the real estates have their LivingSqft smaller that 5000. Since this represents outliers, if these observations represents a small percentage from the entire dataset I am going to drop them.

In [113]:
str(len(list(df[df['LivingSqft']>5000]['LivingSqft'].index)) / len(df['LivingSqft']) * 100) + '% from the entire df'
Out[113]:
'0.6381812263358975% from the entire df'
In [114]:
df = df.drop(df[df['LivingSqft']>5000].index)
In [115]:
def plot_distribution_sqft(df, column_name):
    fig, ax = plt.subplots(1,2,figsize=(15,15))

    plt.subplot(3,2,1)
    plt.boxplot(df[column_name])
    plt.title(column_name + ' boxplot')

    plt.subplot(3,2,2)
    sns.distplot(df[column_name], hist = False, kde = True,
                     kde_kws = {'shade': True, 'linewidth': 3}, 
                     label = 'Distribution RCON + RSFR')
    sns.distplot(df[df['isRCON'] == 1][column_name], hist = False, kde = True,
                     kde_kws = {'shade': True, 'linewidth': 3}, 
                     label = 'Distribution RCON')
    sns.distplot(df[df['isRCON'] == 0][column_name], hist = False, kde = True,
                     kde_kws = {'shade': True, 'linewidth': 3}, 
                     label = 'Distribution RSFR')
    plt.title('Distribution plot ' + column_name)
    plt.show()
    
plot_distribution_sqft(df, 'LivingSqft')
In [116]:
plt.boxplot(df['LandSqft'])
plt.show()
In [117]:
str(len(list(df[df['LandSqft']>500000]['LandSqft'].index)) / len(df['LandSqft']) * 100) + '% from the entire df'
Out[117]:
'0.755621404915461% from the entire df'

Similarly, for the 'LandSqft' feature.

In [118]:
df = df.drop(df[df['LandSqft']>500000].index)
In [119]:
plot_distribution_sqft(df, 'LandSqft')

Features that represent a number of elements from the house

In [120]:
df[features_grouped_dict['number_of']].head()
Out[120]:
StructureNbr Bedrooms TotalBaths FirePlaces Stories UnitsInBuilding
0 1 2 2.0 0 0 1
1 1 1 1.5 0 0 1
2 1 2 1.0 0 0 1
3 1 1 1.5 0 0 1
4 1 2 2.0 0 0 1
In [121]:
plt.figure(figsize=(7, 4))
sns.distplot(df['TotalBaths'])
plt.title('Baths distribution')
plt.show()
In [122]:
df[df['SellPricePerLivingSqft']<4000].boxplot(column = ['SellPricePerLivingSqft'], by='Bedrooms', figsize=(7,10))
plt.show()
In [123]:
df[df['SellPricePerLivingSqft']<4000].boxplot(column = ['SellPrice'], by='Bedrooms', figsize=(7,10))
plt.show()
In [124]:
categorical_columns = ['Condition', 'Quality', 'HasPatioPorch', 'HasPool']
fig, ax = plt.subplots(4,1,figsize=(13,9))

for i in range(len(categorical_columns)):
    plt.subplot(2,2,i+1)
    sns.boxplot(df[categorical_columns[i]], df['SellPrice'])

plt.show()

Principal components analysis

Principal Components Analysis is a dimension-reduction technique. Its purpose is to reduce redundancy in data. See, sometimes people gather tons of data with 20, 30, or more variables. They may think they’re measuring 20 or 30 things, but they may be measuring fewer underlying factors (often called “latent traits”) than the number of measurements.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components.

The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

In [125]:
char_type_columns = df.select_dtypes(include=object).columns.tolist()
columns_that_dont_need_standardisation = char_type_columns.copy()
columns_that_dont_need_standardisation.append('SellPrice')
In [126]:
features = [x for x in list(df) if x not in columns_that_dont_need_standardisation]
In [127]:
features
Out[127]:
['CountyFipsCode',
 'BuildingCode',
 'StructureNbr',
 'LandSqft',
 'LivingSqft',
 'GarageSqft',
 'BasementSqft',
 'BasementFinishedSqft',
 'Bedrooms',
 'TotalBaths',
 'FirePlaces',
 'YearBuilt',
 'ConditionCode',
 'QualityCode',
 'HasPatioPorch',
 'PatioPorchCode',
 'HasPool',
 'LandValue',
 'ImprovementValue',
 'TotalValue',
 'AssessedYear',
 'PropTaxAmount',
 'Zip',
 'Latitude',
 'Longitude',
 'ConstructionCode',
 'Stories',
 'UnitsInBuilding',
 'FoundationCode',
 'ExteriorCode',
 'CoolingCode',
 'HeatingSourceCode',
 'IsWaterfront',
 'View',
 'ViewScore',
 'LastSalePrice',
 'ArmsLengthFlag',
 'OwnerOccupied',
 'DistrsdProp',
 'isRCON',
 'SellDate_Year',
 'SellDate_Month',
 'SellDate_Day',
 'SellPricePerLivingSqft']

Data standardization - for numerical features only ($x = \frac{x - \overline{x}}{sd(x)} $)

In [128]:
X = df[features].copy()
y = df['SellPrice']

# In general a good idea is to scale the data
X = stats.zscore(X)
In [129]:
normalised_df = pd.DataFrame(X,columns=features)
normalised_df.tail()
Out[129]:
CountyFipsCode BuildingCode StructureNbr LandSqft LivingSqft GarageSqft BasementSqft BasementFinishedSqft Bedrooms TotalBaths FirePlaces YearBuilt ConditionCode QualityCode HasPatioPorch PatioPorchCode HasPool LandValue ImprovementValue TotalValue AssessedYear PropTaxAmount Zip Latitude Longitude ConstructionCode Stories UnitsInBuilding FoundationCode ExteriorCode CoolingCode HeatingSourceCode IsWaterfront View ViewScore LastSalePrice ArmsLengthFlag OwnerOccupied DistrsdProp isRCON SellDate_Year SellDate_Month SellDate_Day SellPricePerLivingSqft
111488 0.913967 0.209481 -0.063776 -0.278334 0.897888 1.074112 -0.463366 -0.400482 0.331710 1.058777 1.043804 1.527802 0.14428 1.231053 1.535354 0.813040 -0.356483 0.548352 1.011278 0.891012 0.026961 -0.779975 0.914567 0.928452 -0.904719 -0.905365 -0.207936 -0.208295 -0.60044 -0.564159 -0.532963 0.641118 -0.04712 -0.379525 -0.675156 0.049625 0.215751 0.534616 -0.265711 -0.605257 0.503851 -1.733934 -0.879780 0.095133
111489 0.913967 0.209481 -0.063776 1.445681 0.395663 -0.619269 -0.463366 -0.400482 0.331710 0.075163 -0.958034 0.734817 0.14428 -0.388776 -0.651315 -0.565882 -0.356483 -0.034923 0.487901 0.254678 0.026961 0.360248 0.915092 0.905013 -0.926234 -0.905365 0.399802 -0.208295 -0.60044 -0.564159 -0.532963 2.143196 -0.04712 -0.379525 -0.675156 0.028936 0.215751 0.534616 -0.265711 -0.605257 -1.757016 1.320611 -0.773826 0.087419
111490 0.913967 0.209481 -0.063776 -0.239661 0.656559 1.271016 -0.463366 -0.400482 0.331710 1.058777 1.043804 0.678175 0.14428 1.231053 1.535354 0.813040 -0.356483 0.411729 0.858627 0.725012 0.026961 0.550783 0.916790 0.927890 -0.917006 -0.905365 0.399802 -0.208295 -0.60044 -0.564159 -0.532963 0.641118 -0.04712 -0.379525 -0.675156 -0.022519 0.215751 0.534616 -0.265711 -0.605257 -1.491031 0.404247 -0.773826 -0.307075
111491 0.913967 0.209481 -0.063776 -0.237936 0.056497 1.152874 -0.463366 -0.400482 0.331710 1.058777 1.043804 0.508250 0.14428 -0.388776 1.535354 0.813040 -0.356483 0.133228 -0.438911 -0.169545 0.026961 -0.019329 0.914660 0.908013 -0.911556 -0.905365 0.399802 -0.208295 -0.60044 -0.564159 -0.532963 0.641118 -0.04712 0.214344 2.194036 -0.031649 0.215751 0.534616 -0.265711 -0.605257 -1.092055 -0.817571 1.239318 -0.301025
111492 0.913967 0.209481 9.134900 0.436417 0.630469 -0.619269 3.997539 3.443202 1.873647 2.042391 1.043804 -1.021078 0.14428 -0.388776 1.535354 0.813040 -0.356483 -0.229349 1.387455 0.648160 0.026961 1.398633 0.912961 0.891887 -0.916911 -0.905365 -0.207936 -0.208295 -0.60044 -0.564159 -0.532963 0.641118 -0.04712 -0.379525 0.281241 0.054445 0.215751 -1.870500 -0.265711 -0.605257 0.769836 0.098793 -0.561916 0.193881
In [130]:
pca = PCA()
pca_houses = pca.fit_transform(X)
In [131]:
pca_houses.shape[1]
Out[131]:
44
In [132]:
principal_houses_df = pd.DataFrame(data = pca_houses, 
                                   columns = ['PC' + str(i) for i in range(1, len(list(normalised_df))+1, 1)])
principal_houses_df['SellPrice'] = list(df['SellPrice'])
principal_houses_df.tail()
Out[132]:
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11 PC12 PC13 PC14 PC15 PC16 PC17 PC18 PC19 PC20 PC21 PC22 PC23 PC24 PC25 PC26 PC27 PC28 PC29 PC30 PC31 PC32 PC33 PC34 PC35 PC36 PC37 PC38 PC39 PC40 PC41 PC42 PC43 PC44 SellPrice
111488 2.810782 1.102795 0.444455 -0.922556 -2.440003 0.696207 0.202314 0.380523 0.253830 -0.400277 -0.222402 1.252516 -0.046897 -1.383859 -0.517342 -0.013662 0.178541 0.362003 -0.732308 0.384578 0.460138 -0.006412 -0.350824 0.516919 0.312214 0.590586 0.683551 0.115905 0.026933 0.139622 -0.070582 -0.368586 -0.122270 0.154070 0.438038 1.369332 0.270148 0.123618 -0.148363 -0.049934 -0.001877 -0.013209 0.004102 -5.343505e-08 464020.000000
111489 1.817438 -0.462983 -0.558584 0.226159 -0.610678 -0.902180 1.308779 1.633002 0.227897 1.004865 0.283750 -1.117855 0.223593 -0.023496 0.099321 -0.201430 -0.535091 -0.044965 -0.102466 1.214098 -0.231973 0.241502 -0.365661 -1.010977 -1.360439 1.280533 -0.751322 -0.655214 0.594632 0.626644 1.405128 -0.519635 -0.276558 0.041248 0.341240 -0.048903 0.044372 -0.428578 -0.052118 0.044815 -0.001985 0.009169 -0.007431 -1.410829e-07 382558.096841
111490 3.097648 1.295066 0.145939 -0.595652 -2.148121 0.047948 1.518957 0.692859 0.325574 0.359898 -0.115551 -0.545013 -0.028714 -0.693699 -0.173345 0.492022 0.030859 0.127968 0.392033 0.625977 0.052266 -0.289076 0.205918 0.510793 0.058050 0.072153 0.457367 0.044531 0.248355 0.249192 0.021883 -0.261002 0.192541 0.085101 -0.229310 0.235566 -0.136488 0.112787 -0.199067 -0.104725 0.004581 -0.008307 -0.004951 -7.422748e-08 179950.000000
111491 3.054244 -0.320013 -0.372380 0.627053 -1.974749 0.431472 0.198391 -0.290336 -0.027486 -0.868028 0.253952 0.697108 -0.052400 0.034724 0.130692 -0.081959 0.049021 0.201495 0.709011 -0.887192 0.031701 -0.954625 1.447650 -0.727404 0.358773 0.170052 -0.670698 0.247804 0.959484 -0.351942 -0.512427 -0.076739 0.233968 -0.055606 -0.473130 -0.014197 -0.215436 0.219628 -0.215281 -0.112903 -0.022106 -0.000241 -0.000177 -1.870012e-07 144000.000000
111492 4.552779 1.828691 -0.494991 -0.936268 2.003300 -2.224797 0.263969 -5.134246 -4.376292 1.864911 -1.318731 0.490638 0.305536 1.725668 -0.520492 -0.820607 5.801103 0.469155 -2.092560 2.571280 -0.346670 0.058747 -0.324204 -1.397272 -0.722479 0.165721 1.214532 -0.800346 0.347536 0.025933 -0.746920 -0.212777 0.214301 0.217370 -0.607092 0.115140 -0.489389 0.307889 0.136581 -0.207394 0.134951 0.018049 0.001947 -1.291045e-07 483000.000000
In [133]:
print('Explained variation per principal component: {}'.format(pca.explained_variance_ratio_))
Explained variation per principal component: [2.54184373e-01 1.14045033e-01 6.87143136e-02 5.33753764e-02
 4.21079062e-02 3.34032176e-02 3.02739289e-02 2.97879005e-02
 2.50381794e-02 2.35111886e-02 2.32078860e-02 2.28761948e-02
 2.26642144e-02 2.14090980e-02 2.09197769e-02 2.05186757e-02
 1.96544266e-02 1.88675903e-02 1.77722499e-02 1.63405168e-02
 1.51985305e-02 1.47904645e-02 1.31750263e-02 1.04511546e-02
 8.69697014e-03 7.66157063e-03 7.06561945e-03 6.19253436e-03
 5.62001122e-03 5.36746752e-03 4.26235457e-03 3.76801471e-03
 3.47472066e-03 3.32665763e-03 2.65118958e-03 2.32683425e-03
 2.00714531e-03 1.91814251e-03 1.82861750e-03 1.40738201e-03
 1.35311655e-04 1.99547220e-06 2.38349638e-07 2.05357544e-14]
In [134]:
plt.figure(figsize=(10,4))
plt.bar([i for i in range(1,45, 1)], height=pca.explained_variance_ratio_, color='#c5a84e')
plt.title('Percentage of variance explained - PCA')
plt.show()
In [135]:
fig = plt.figure(figsize=(15,15))
plt.scatter(pca_houses[:,0], pca_houses[:,1], c = y)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
In [136]:
def biplot(score,coeff,ids=[1, 2], labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    
    fig = plt.figure(figsize=(25,25))
    
    plt.scatter(xs * scalex,ys * scaley, c = y)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')

    plt.xlim(-0.4, 0.5)
    plt.ylim(-0.35,0.55)
    plt.xlabel("PrincipalComponent{}".format(ids[0]))
    plt.ylabel("PrincipalComponent{}".format(ids[1]))
    plt.grid()
    plt.show()

biplot(pca_houses[:,:2], np.transpose(pca.components_[:2, :]),ids=[1, 2], labels=features)
In [137]:
biplot(pca_houses[:,1:3], np.transpose(pca.components_[1:3, :]),ids=[2, 3], labels=features)
In [138]:
ax = plt.figure(figsize=(13,13)).gca(projection='3d')

ax.scatter(
    xs=principal_houses_df['PC1'], 
    ys=principal_houses_df['PC2'], 
    zs=principal_houses_df['PC3'], 
    c=principal_houses_df['SellPrice'], 
    cmap='tab10'
)

ax.set_xlabel('pca-one')
ax.set_ylabel('pca-two')
ax.set_zlabel('pca-three')
plt.show()

Non-liniar mappings in $\mathbb{R}^2$ : tSNE

In [139]:
from sklearn.manifold import TSNE

tsne = TSNE()
X_embedded = tsne.fit_transform(X[:,][:10000])

fig = plt.figure(figsize=(15, 15))
sns.scatterplot(X_embedded[:,0], X_embedded[:,1], hue=y[:10000], size=y[:10000], legend='brief', palette="Set2")
plt.show()

Projection pursuit

Export the processed dataframe as csv

In [140]:
df.to_csv('processed_df.csv')
In [141]:
normalised_df.to_csv('normalised_df.csv')
In [142]:
# code in the R notebook

It is certainly clear that when predicting the SellPrice of a house it is necessary to look for features that have a rich predictive power.

In this manner, it was noticed in this notebook, that features related to taxes amount, sell price per living sqft are correlated to the current sell price. Predictors which are related to quality or dimensionality or the ones that represent a number of items in the house can increase the accuracy of a regression model, fact which is quite intuitive.

The geographical localization can also represent an important aspect. There are going to be used the code related features as a replacement for categorical / string features when building a regression model. This will help in skipping the part when using different techniques to transform the categorical columns into numerical ones.

By looking from a classification point of view, the significant features are analogically the same, with the specification that the RSFR houses are more expensive than the RCON ones.